EUROLAN 2019 Summer School

I started drafting this post in the last day of EUROLAN 2019 Summer School; while the participants were about to take the last Kahoot! quiz and fill the feedback form.

This was the first summer school where I was on the other side — in the organization committee and I have to admit that I liked it very much.

I liked the joy and enthusiasm of the participants which were as if they embarked for an adventure. Is some sense, this was an adventure for them, which could be seen from their attitude for handling the exercices — each task was tackled with maximum involvment and seriosity by everyone.


The exercices were meant to cover a wide range of activities for computational linguistics — from manually annotating a small corpus of trainig phrases to training a machine learning model.

The mission of our micro-team (Tiberiu Boroș, Cristian Pădurariu and myself) was to teach the participants how to train a Machine Learning model using NLP-Cube - a Natural Language Processing Framework where Tiberiu is the main contributor.

Knowing that there will be participants with both Windows and Mac OS, we settled in our discussions on enclosing NLP-Cube into a Docker image which would be built by each participant in order to train the model.

Unfortunately, despite our best efforts and the great presentation made by Tiberiu the results were disastrous — the laptops of over a half of participants didn't met Docker requirements; from the reamaining people a lot did not have enough disk space for the various transformations of the corpus and additional utilities. Overall, only a few persons were able to start the training process.

At the end of the school, the organization committee setup a small contest with for the participants: each team would have to come up with an idea of an application and after refining the idea they would get exactly 5 minutes to pitch the idea to the ad-hoc jury. The application was subject to the constraints of being somewhat feasible to develop and should use as many topics taught during the summer school as possible.

The ideas were great; the one that I liked the most was about an application that would do speech recognition for people suffering from some form of heavy speech impediment.

When a part of the jury went to deliberate over which team should get which prize, Gabriela Haja started the closing speech. It was a simple, elegant and beautifull talk where she praised the ideas of the participants but also advised them that they would need to work hard on the ideas themselves and on the skills of the authors in order to see their idea put to practice.

Overall it was a great experience and I'm gratefull for the opportunity to participate. And now with the school long gone, I'm reviewing/archiving the to-do-list of activities for the summer school and building the list of…

Lessons learned

Infrastructure is a problem
We learned the hard way that not everyone has high-end laptops that can run resource-heavy applications.
Murphy's law is still applicable
"Anything that can go wrong, will go wrong", Murphy's las states. And indeed, we had a participant that couldn't build the docker image for some reason but when we tried to load an already built image into docker via a USB stick, the USB port failed.
Think and tailor your workshop for the audience
Being heavily involvment into software development, Tiberiu and I made some wrong assumptions — people will know how to install Docker on their machines, the participants will know how to execute the command lines we provided etc. The vast majority of the participants were linguists at core and such tasks that are perceived as trivial by us are surely not trivial for them.
You learn a lot while teaching
This is something that cannot be overstated. I knew I will be learning a lot of new stuff from the collaboration with Tiberiu (I did not knew NLP-Cube existed befor this summer school) but to my great suprise I also learned a great deal of stuff just by being present and helping people.
Allow yourself to fail
The dissapointing results of our endeavor to train a Machine Learning model had an impact on my morale but while discussing with the more experienced members of the committee

2018 year in review

Looking back at 2018 I can say for sure that it has been a busy and thankfully fruitfull year for me; I have managed to juggle work and community tasks while also spending time with my family.

So, without further ado here are my biggest accomplishments in 2018:

Presentation on Big Data Analytics

Being known as a Big Data enthusiast at my job, I was asked if I could give a presentation on Big Data. I was more than happy to do so and this way I got a glimpse at how it feels when giving a lecture - the presentation was scheduled in one of the big lecture rooms at the Faculty Of Economics and Business Administration; can't say I didn't liked the feeling. The slides are available from my OneDrive account.

Artificial Neural Networks tutorial

A few weeks later I was hosting a tutorial on Artificial Neural Networks at Alecu Russo State University from Bălți, Moldova. The tutorial was the first event at the International Conference on Mathematics, Informatics and Information Technologies and also my first academic talk.

PIN Magazine article about Artificial Intelligence

In june we celebrated one year since the Iași AI community was created with a cover story in PIN Magazine about Artificial Intelligence. I authored one of the cover story articles in which I expressed my opinion about how Artificial Intelligence, although it had tremendous success lately, it's still at the beginning of its evolution and we, humans, must set our expectations accordingly but still expect that AI will change our lives. You can read the whole article here.

Iași AI TensorFlow workshops

At the end of August and in the beginning of October I co-hosted two TensorFlow worksops together with Ciprian Tălmăcel organized by the Iași AI community. This was my debut as a presenter within the community and proved to be a great experience and success. I had the opportunity to learn a lot from Ciprian and also to improve my skills through the feedback given by attendees. You can find the Jupyter Notebooks for the workshop in the dedicated Github repository.

RNN presentation @ Iași AI

A month later, in November, was my second appearance within the Iași AI community, this time with a presentation about Recurrent Neural Networks. The main focus of a presentation was how simple the code can be if you possess the theoretical background and how you can actually learn (to some extent) the theory behind just looking at the implementation. I have published the slides, the code and the LaTeX code for the presentation in a dedicated Github repository.

Other activity @ Iași AI

Gladly, my activity within Iași AI community did not end with the RNN presentation; later on, I helped organize the AI opportunities panel which was the last official event of the community for 2018. In parallel we started puring more effort into implementing an Open Data Hub initiative for Iași; this however is a project for the upcomming year(s).

Promoted to Technical Lead

Last but not least, I got promoted at work. In April I switched projects at work and took the switch as an opportunity to affirm myself. The efforts I've put into building solid and scalable products and also improving the existing ones have been observed by the management team and at the end of the year I got promoted to the role of Technical Lead.

It's already the start of 2019 and I'm excited to what's comming ahead - the Open Data Hub, the newly opened opportunities at work, the upcomming projects in academia and many more I don't know of. What I do know is that I will do my best to do even better in 2019.

English Romanian dictionary for Machine Learning

This post is an initiative to build a list of Romanian translations for Machine Learning terms.


Artificial Intelligence = Inteligență Artificială


Biased = tendențios, subiectiv.


Eigenvalues (vectors) = valori (vectori) proprii.


Features = atribute +, particularități, caracteristici, trăsături+.


Gold corpus = corpus de referință


Machine Learning = Învățare Automată


Outliers = valori extreme.


Skewed = nesimetric.


Threshold = prag.

Toy problem = problemă didactică.


Big thanks to Gabriela Haja and Alex Moruz for reviewing this list.

Automating custom workflow in Emacs

Due to the lack of human resources in a research project I'm involved the team decided to take upon itself the semantic comparison of 3000 pairs of papers from the medical domain.

Each paper is a json file with structured contents of the publication like in the picture below: nil

Since we also want to do at least some cross-validation we decided that each member of the team should compare 2000 pairs of publications so that each pair will be compared by two out of three persons from the team. So we split the 3000 publication pairs into 3 sets which were saved into csv files with the following structure: file1, file2, similarity_score; where file1 and file2 are the names of the files and similarity_score is to be filled with the semantic similarity score.

My first idea was to have Emacs split into three windows to allow for both registering the score and a side-by-side comparison of files: nil

For each pair of files from the left window I would:

  1. Copy the name of the first file using C-a C-SPC C-u 2 M-f M-w
  2. Go the the top window on the right using C-x o 2 (I'm using ace-window thus after pressing C-x o I can select the window to switch to by pressing it's number key)
  3. Open the file using C-x C-f then navigating to the directory and yanking the file name at the end
  4. Move back to the first window using C-x o 1
  5. Copy the name of the second file by first moving over the first comma with C-f then C-u 2 M-f M-w
  6. Go to the bottom window on the right using C-x o 3
  7. Repeat step 3
  8. Repeat step 4
  9. Compare publications and register score in similarity_score column

Although this workflow seems quite laborious due to so many steps I've gotten used quite rapidly (from the second pair) to it and managed to form some sort of temporary muscle memory for opening files in the sense that I was doing that on full auto-pilot, without thinking that I'm opening files.

However, there was a problem with this workflow: the directory in which the files to be compared are located contains around 100 K such json files and on my 5400 RPM hard-disk it took what seemed like forever to scan the directory when pressing C-x C-f in order to give me the benefit of autocompletion which I did not need because I already knew the files names. So basically for each pair of publications I was stuck waiting twice for the directory to be scanned.

And then I got an idea: I already knew the file names; the directory containing them is the same so wouldn't it be faster to send the full path (obtained by concatenating file name and directory path) to find-file function instead of waiting for a directory scan?

I switched to the *scratch* buffer and evaluated a find-file call with the full path of a file and of course, the file was loaded instantly.

So I decided to take it a bit further: Can't I, while on a line in left window open the files on that line in the other two windows?

Of course I can because Emacs is awesome. All I need to do is write some code lisp code that will do what I want.

Fisrt things first, I declared a (global) variable to hold the full path to the directory containing the files:

(defvar *compare-publications-dir*
  "The location of files to compare.")

Then I created a function to get the whole line from the csv file:

(defun get-current-line()
    (let ((start (point)))
      (buffer-substring-no-properties start (point)))))

This function moves the point to the beginning of the line by calling (beginning-of-visual-line) then saves the position in a local variable start and moves to the end of the line via a call to (end-of-visual-line). In the end it returns the substring which is between start position and the position returned by (point) function.

Having the line of text I need to: a) Split the line by ~,~ b) Store first and second parts of the line into two variables called file-1 and file-2 c) Move to the top-right window d) Concatenate the values of *compare-publications-dir* and file-1 and pass the result to (find-file-readonly) (I don't want to accidentally change the files being compared) e) Move to the bottom-right window f) Repeat d) with file-2 instead of file-1 g) Return to the left window

This was also implemented with a function which can be called interactively via M-x:

(defun compare-publications()
  (let* ((files (csv-split-string (get-current-line) ","))
	 (file-1 (car files))
	 (file-2 (car (cdr files))))
    (other-window 1)
    (find-file-read-only (concat *compare-publications-dir* file-1))
    (other-window 1)
    (find-file-read-only (concat *compare-publications-dir* file-2))
    (other-window 1)))

And that's it. After calling eval on the functions above I have an automated workflow. Now, to compare two files, I just navigate to a line in the csv file and from there type M-x compare-publications. The only thing left to do manually (beside the comparison itself) is to press the key for the similarity_score of files from that line.

ServiceActivationException when auto-starting WCF services with AutofacServiceHostFactory

I switched teams at work and as a welcome gift into the new team I got to investigate the following error:

Exception: System.ServiceModel.ServiceActivationException: The service '/AuthorisationService.svc' cannot be activated due to an exception during compilation. The exception message is: The AutofacServiceHost.Container static property must be set before services can be instantiated.. —> System.InvalidOperationException: The AutofacServiceHost.Container static property must be set before services can be instantiated. at Autofac.Integration.Wcf.AutofacHostFactory.CreateServiceHost(String constructorString, Uri[] baseAddresses) at System.ServiceModel.ServiceHostingEnvironment.HostingManager.CreateService(String normalizedVirtualPath, EventTraceActivity eventTraceActivity) at System.ServiceModel.ServiceHostingEnvironment.HostingManager.ActivateService(ServiceActivationInfo serviceActivationInfo, EventTraceActivity eventTraceActivity) at System.ServiceModel.ServiceHostingEnvironment.HostingManager.EnsureServiceAvailable(String normalizedVirtualPath, EventTraceActivity eventTraceActivity) — End of inner exception stack trace — at System.ServiceModel.ServiceHostingEnvironment.HostingManager.EnsureServiceAvailable(String normalizedVirtualPath, EventTraceActivity eventTraceActivity) at System.ServiceModel.ServiceHostingEnvironment.EnsureServiceAvailableFast(String relativeVirtualPath, EventTraceActivity eventTraceActivity) Process Name: w3wp Process ID: 9776

The troublesome service is hosted in an ASP.NET web application and it's preloaded with a custom implementation of IProcessHostPreloadClient which more or less does what's described in this blog post. Since the project hosting the service is using AutoFac as it's DI framework/library the service is setup to use AutofacServiceHostFactory as the service factory:

<%@ ServiceHost
  Service="AuthorizationServiceImpl, AuthorizationService"
  Factory="Autofac.Integration.Wcf.AutofacServiceHostFactory, Autofac.Integration.Wcf" %>

After some googling for the error I turned out to the AutoFac documentation page where I got the first idea of what is happening:

When hosting WCF Services in WAS (Windows Activation Service), you are not given an opportunity to build your container in the ApplicationStart event defined in your Global.asax because WAS doesn’t use the standard ASP.NET pipeline.

Ok, great! Now I know that ServiceHostingEnvironment.EnsureServiceAvailable() method (which is called to activate the service) doesn't use the HTTP pipeline from ASP.NET. A solution to this issue is in the next paragraph of the documentation:

The alternative approach is to place a code file in your App_Code folder that contains a type with a public static void AppInitialize() method.

And that's what I did. I went to to the project in Visual Studio, added a special ASP.NET folder named App_Code and added a class named AppStart to it with a single method public static void AppInitialize() which contained all the required bootstrapping logic for AutoFac. I redeployed the application on but the error kept popping and it's after carefully reading the comments from this StackOverflow answer and this blog post on how to initialize WCF services I found why the AppInitialize method wasn't invoked: it was because the AppStart.cs needs it's build action to be Content not Compile

So when getting the ServiceActivationexception with the error message The AutofacServiceHost.Container static property must be set before services can be instantiated make sure to have the following:

  1. The special ASP.NET folder App_Code
  2. A class in App_Code having a method with this signature public static void AppInitialize() which contains all the required initialization code
  3. The build action of the file containing the above method is set to Content as shown in the picture below


MediatR - Handler not found error when the DataContext couldn't be initialized


If you use MediatR package and it suddenly it fails with Handler was not found for request of type <type> inspect the dependencies of the handler it fails to create/invoke. One or more of those dependencies (a DbContext in my case) throws an error when instantiated and the error is making MediatR fail.

Jimmy Bogards' MediatR is a little gem of a package I like using because it enables a good separation of business logic from the boilerplate code and provides a clean and structured enforcement of the Single Responsibility Principle.

I use this package extensively in one of my outside work projects (I'm proud to say that it's not a pet project anymore) to delegate requests/commands to their respective request/command handlers. The project itself consists of two parts - an ASP.NET MVC application for public access and back office management and a WebAPI application used for registering payments. In order to keep both Web Application and Web API URLs consistent (and pretty) I have hosted the Web API application as a virtual directory inside the main Web Application.

Recently, after an update of the application the payment module went down (giving me a tiny heart attack). As expected I dove into the application logs and after some thorough search I found the culprit with the following error message:

An unhandled exception of type 'System.InvalidOperationException' occurred in MediatR.dll Additional information: Handler was not found for request of type GetAuthorizedUserRequest. Container or service locator not configured properly or handlers not registered with your container.

The exception was popping inside the IsAuthorized method of a custom AuthorizeAttribute

protected override bool IsAuthorized(HttpActionContext actionContext)
	  var authorizationToken = new AuthorizationToken(actionContext.Request);
	  if (String.IsNullOrEmpty(authorizationToken.Value))
		return false;
	  var request = new GetAuthorizedUserRequest
		AuthorizationToken = authorizationToken.Value
	  var user = _securityService.GetAuthorizedUser(request);
	  return user != null;
    catch (Exception)
	  return false;

The first thing to do was to thoroughly inspect what does the IoC container (StructureMap in my case) has registered. After a glimpse through the output of WhatDoIHave() method I saw that the handler GetAuthorizedUserRequestHandler was indeed registered as a IRequestHandler<GetAuthorizedUserRequest, GetAuthorizedUserResponse>.

So, what is the problem then? The InnerException property of the exception that was caught was null and I was stuck.

On the dawn of divine inspiration I decided to comment out the existing constructor of the request handler and create a default one (also return a dummy user). It worked - the exception wasn't thrown and the user got authenticated.

However, the next request (dispatched through MediatR) that had to query something in the database failed which gave me the idea that there must be some issues with the DbContext initialization (I use Entity Framework). Sure enough - when I put a breakpoint in the constructor of my DataContext class (derived from DbContext) I got an exception mentioning that the key "mssqllocaldb" is missing from <connectionStrings> section.

Then, I remembered that the latest code update also came with an update of Entity Framework NuGet package and it dawned upon me why the MediatR was failing. As I said in the beginning, the Web API application is hosted under the main Web Application. This means that the <entityFramework> configuration element in the child application is inherited from the parent one so the Web.config file of the child application did not contain any section related to Entity Framework. When I did the upgrade of the NuGet package the installer added the configuration section with default values. Those default and wrong values were read by the DbContext class constructor and since the values were wrong the constructor failed. After deleting the <entityFramework> configuration element the application went back online.

The common pitfalls of ORM frameworks - RBAR

ORM frameworks are a great tool especially for junior developers because they allow bluring the line between the application logic and the data it crunches. Except that the aforementioned line blurring advantage may become a real production issue if not taken in consideration when writing the code.

Let us consider an example. Let's suppose we're working on a (what else?) e-commerce platform. Somewhere in the depts of that platform there is a IOrderService which exposes the following method:

public interface IOrderService
    void PlaceOrder(Guid customerId, IEnumerable<OrderItem> itemIds)

where OrderItem holds the data about an ordered item.

public class OrderItem
    public Guid ItemId { get; set; }

    public int Quantity { get; set; }

The PlaceOrder method needs to:

  • Lookup the Customer in the database
  • Create a new CustomerOrder instance
  • Add each Item to the order and decrease stock count
  • Save the CustomerOrder in the database

Of course, since we're using an ORM framework, the classes used by the repositories - Customer, CustomerOrder and Item - are mapped to database tables.

Given the above, someone would be tempted to implement the PlaceOrder method like this:

public void PlaceOrder(Guid customerId, IEnumerable<OrderItem> orderItems)
    var customer = _customerRepository.Get(customerId);
    var order = new CustomerOrder(customer);

    foreach(var orderedItem in orderItems)
	var item = _itemRepository.Get(orderedItem.ItemId);


And why wouldn't they? It seems the most straightforward transposition of the requirements defined above. The code behaves as expected in both Dev and QA environments and afterwards it's promoted to production where lies a database with hundreds of thousands of rows in the Items table. There also, the behavior is as expected until one day an eager customer wants to buy 980 distinct items (because why not?).

What happens with the code above? It still works well but the database command times out and the customer cannot place a significant order.

So what is the problem? Why it times out? Well, because the aforementioned line between application logic and database is blurred enough for the iterative paradigm to creep into the set-based one. In the SQL community this paradigm creep has a name - Row By Agonizing Row.

To put it in the context of the example above - it takes more time to do 980 pairs of SELECT and UPDATE operations than to do one SELECT which returns 980 rows followed by one UPDATE which alters 980 rows.

So, let's switch the paradigm and start working with collections in our code to minimize the number of operations in the database. The first thing to do is to load all items in bulk instead of loading them one by one. This will reduce the number of SELECT operations from 980 to 1 (a whooping 50% of the number of operations). We still need to update the stock counts for each item individually because the ORM framework doesn't know how to translate the changes for each item into a single UPDATE statement but considering that we've halved the total number of operations let's give this approach a try shall we?

public void PlaceOrder(Guid customerId, IEnumerable<OrderItem> orderItems)
    var customer = _customerRepository.Get(customerId);
    var customerOrder = new CustomerOrder();

    var items = itemRepository.Items
	      item => item.Id,
	      orderedItem => orderedItem.ItemId,
	      (item, _) => item)
	.ToDictionary(i => i.Id);

    foreach(var orderedItem in orderedItems)
	var item  = items[orderedItem.ItemId];


This will solve the problem with the timeout but will create another one - useless load on the system. The code loads 980 rows from the database but only uses two attributes of each row - Id and Barcode. We might say that this can be solved by projecting an Item into a tuple of <Barcode, Id> but this would be a partial solution because we can still place a great burden on system memory by sending an request of 10k items. Also, there are still 980 UPDATE statements that need to be executed which is still a lot.

The best approach to this is to not load any data at all from the database and to do the processing as close to the actual data as possible. And how can we do that? Exactly - with stored procedures.

declare procedure CreateCustomerOrder(
	@customerId uniqueidentifier not null,
	@orderItems udttorderitems readonly)
    set no_count on

    update sc
    set sc.Count = sc.Count - o.Quantity
    from StockCounts sc
    join Items i on sc.ItemId == i.Id
    join @orderItems 0 on i.Id = o.ItemId

    insert into CustomerOrder(CustomerId, OrderDateTime)
    values (@customerId, GetDate())

    insert into OrderLines(OrderId, ItemId, Quantity)
    select scope_identity(), i.Id, o.Quantity
    from Items i
    join @orderItems o on o.ItemId = i.Id

Now, of course in real life situations there won't be a customer that orders almost 1000 items with a single order and the second approach (bulk load items and iterate the collection) will do just fine. The important thing to keep in mind in such cases is the need to switch from a procedural mindset to a set-based one thus pruning this phenomenon of paradigm creep which can become a full-blown RBAR processing.

Python development using Emacs from terminal

A few weeks ago, while working on a hackathon project I found myself very disappointed with my progress.

I had the impression that I can do better but something is holding me back and then I realized that I was too distracted by Alt-Tab-ing through all open applications, iterating through dozens of open tabs in the browser and spending too much time on websites that were of no use at that time.

At that moment, on a whim I decided to try and eliminate all of these distractions the hard way - by not using the X server at all (I was working on Kubuntu).

Since I was mainly working with Python code and occasionally I would need to open some file for inspection and all of these were done from Emacs I said to myself:

Emacs can be opened from console so why not start working on hackathon from console?

Said and done. Ctrl-Alt-F1 and I was prompted with the TTY cursor. I logged in, started Emacs opened all the required files and started working. All good until I found myself in the situation where I needed to lookup something on the Internet. I knew I could use eww as a web browser so normally I did so (yeah, I'm one of those people that use Bing instead of Google):

M-x eww
Enter URL or keywords:

And nothing… Oh, wait, I remember needing to enter some username and password when connecting to the Wi-Fi but I wasn't prompted for those after logging into terminal. How do I connect to the network?

As there was no way for me to find that out without using some sort of GUI (I'm not that good with terminals) I started a new X session, connected from there to Wi-Fi and found this StackOverflow answer. So I logged back to the terminal and from Emacs started eshell with M-x eshell. From there I issued the following command

nmcli c up <wi-fi-name>

which connected me to the Wi-Fi network.

Note: I got connected because on previous sessions I opted to store the credentials for the connection; to get a prompt for username and password for the Wi-Fi use the --ask parameter like this:

nmcli --ask c up <wi-fi-name>

After connecting I resumed my coding and only at the end of the hackathon I stopped to ponder upon my experience; it wasn't as smooth as I expected. Although I consider a big plus the fact that I was able to issue shell commands within Emacs through eshell there were some hick-ups along the way.

The first thing I noticed is that under terminal not all shortcuts that are very familiar for me are available. Namely, in org-mode the combination M-right which is used for indentation, moving columns within a table and demoting list items is not available; instead I had to use either C-c C-x r shortcut or explicitly invoke the command using M-x org-meta-right. Although I did not invoke this command frequently, without the shortcut I felt like I was pulled out of the flow each time I had to use an alternative method of invoking the command.

The second and by far the biggest nuisance was the lack of proper web-browsing experience. Although I most frequently landed on StackOverflow pages and although eww rendered them pretty good (see image below) the lack of visual experience I was used to gave me a sense of discomfort. nil

However, when I got to analyze how much I have accomplished while working from terminal I was simply amazed. Having no distraction and meaningless motion like cycling through windows and tabs had a huge impact on my productivity. I was able to fully concentrate and immerse in the code and by doing so I had a lot of work done.

Rename multiple files with Emacs dired

While adding text files from within a folder to a project file I noticed that the files in the folder were lacking naming consistency. Namely, there were files which had the .txt extension and files without extension, as shown in the image below: nil

Since there were about 100 files without extension I started asking myself: Is there a way to add .txt extension to those files without manually renaming each one?

Of course there is. Here's what I did using Emacs and dired:

  • M-x dired to the desired directory (obviously)
  • In the dired buffer enter the edit mode with C-x C-q
  • Go to the last file that has extension before the block of files without extension.
  • Starting from that file, place a mark and select the whole block of files without extension (the selection should include the last file with extension).
  • Narrow to the selected region using M-x narrow-to-region or C-x n n The buffer should look like the image below: nil
  • Move to the beginning of buffer using M-<
  • Start defining a new keyboard macro using C-x (
    • Move to next line using C-n
    • Navigate to the end of line using C-e
    • Add the .txt extension
  • Save the macro with C-x )
  • Now that I have a macro to add .txt extension to a file name I just need to run it as many times as there are unnamed files (100 in my case). To do so just C-u 100 F4. This will repeat the macro 100 times.
  • Once all the files are renamed exit the narrow-region using M-x widen or C-x n w
  • Save changes with C-c C-c

That's it!

Managing bibliography using Emacs Org-Mode and Org-Ref

Since I've started to use Emacs more and more I started wondering whether I can use org-mode to keep a reading list/bibliography?

A quick search led me to this blog post where the author was presenting his setup for the same thing. However, after reading into the post I saw that the author uses a combination of tasks and a reading list which requires custom code to be executed and is too complex for my needs.

All I want is a simple list that:

  • should be available on multiple workstations
  • can be built/managed with out-of-the-shelf components and without much effort
  • should allow me to change the status of an entry.

I did however liked the idea of using references to the papers being read and since I recently saw a YouTube video presenting org-ref I thought I should give it a try.

To handle the availability part I decided to use Dropbox which is also suggested by org-ref.

Setup org-ref

org-ref is available on Melpa so to install it just type M-x package-install org-ref. Afterwards copy the code below to your init file and adjust the paths:

(setq reftex-default-bibliography '("~/Dropbox/bibliography/references.bib"))
;; see org-ref for use of these variables
(setq org-ref-bibliography-notes "~/Dropbox/bibliography/"
      org-ref-default-bibliography '("~/Dropbox/bibliography/references.bib")
      org-ref-pdf-directory "~/Dropbox/bibliography/bibtex-pdfs/")

(setq bibtex-completion-bibliography "~/Dropbox/bibliography/references.bib"
      bibtex-completion-library-path "~/Dropbox/bibliography/bibtex-pdfs"
      bibtex-completion-notes-path "~/Dropbox/bibliography/helm-bibtex-notes")

Creating the reading list

With org-ref in place, it was time to setup the reading list so I created a new file named under ~/Dropbox/bibliography/ with the following header:

#+TITLE: Reading list
#+STATUS: "Maybe" "Pending" "Reading" "Finished" ""

The first line obviously defines the title of the document. The second line defines the values for status where:

  • Maybe means that reading the entry is optional
  • Pending - the entry will be read sometime after finishing the item that I'm currently reading
  • Reading - the item currently being read
  • Finished - the entries that are already read.

Adding an entry to the list

  • Add bibtex entry in references.bib file. E.g.:
      title={Distributed representations of sentences and documents},
      author={Le, Quoc and Mikolov, Tomas},
      booktitle={Proceedings of the 31st International Conference on Machine Learning (ICML-14)},
  • In the file add the title to the list using M-return
  • Add Status and Source properties
    • With the cursor on the header:
      • Press C-c C-x p
      • Select or write Status
      • Press return
      • Select the value for status (e. g. Pending)
      • Press return
    • With the cursor on the header:
      • Press C-c C-x p
      • Write or select Source
      • Press return
      • If you know the citation key (le2014distributed in the example above) then you can write directly cite:le2014distributed; otherwise, leave the value for Source empty and put the cursor after the property declaration. Then, press C-c ] and select the entry from the reference list.

Repeat the steps above and you should end up with a list like this: nil

Change the status of an entry

To change the status of an entry:

  • Navigate to the desired entry
  • Repeat the steps above for setting the Status property and select the proper value for Status

Status overview

After creating the list you may want to have an overview of the status for each entry. This can be achieved using Org Column View. The setup for column view is in the third line of the header


which tells org-mode how to display the entries. Namely, we're defining two columns:

  1. Item which will display the heading on 120 characters and
  2. Status which will take as much space as needed to display the status

Switching to column view

To switch to column view, place the cursor outside the headings and press C-c C-x C-c (or M-x org-columns). The list should look like the image below: nil If your cursor was on a heading when pressing C-c C-x C-c (invoking org-columns) then the column view will be activated only for the selected heading.

Exiting column view

To exit column view position the cursor on a heading that is currently in column view and press q.

That's it. Happy reading!