The Role of Machine Learning in Legal Discovery

Recently Lofty Labs was engaged by a law firm.

Now, that's not the type of business we generally target as clients for a data analysis consultancy.  It turned out they had a data problem, though.  Several terabytes of them, in fact.

When two large companies sue each other, a lot of historical communication records get exchanged between the two sides through the facilitation of the court in a process known within the field as "discovery".  This is the process you probably think of from the movies, where lawyers can be seen carting dollies full archival boxes overflowing with paper documents.  The process hasn't changed much except that today all of those documents are stored in an electronic format.

Within the field, the specific processing and handling of these digital documents is referred to as eDiscovery or simply "ESI" (for Electronically Stored Information).

We built eDiscovery software around several cases for our client.   Searching through several million emails and PDF documents looking for combinations of custodians and specific key phrases is really difficult for humans to do.  It's pretty easy for a cluster of Elasticsearch nodes though, so with a little bit of thought to an interface and a whole lot of data scrubbing later our clients were digging through their dataset at a substantial clip.

As the engagement has tapered off we've started to turn our attention at Lofty Labs to what the future of eDiscovery looks like in the age of data science.  

The entire process, in its current state, is geared heavily towards people (litigation analysts) using their monkey brains to instruct software how to find relevant documents, using filters like:  text that matches the exact phrases they are looking for, in the time periods they wish to target, between the two custodians they believe to have been involved in the exchange, and so on.  Then analysts exhaustively explore all of the combinations therein to build evidence for their case.

Any data science enthusiasts might wince at the notion of this being performed in such a manual way, but when you note that we're talking about an industry that still expects to bill and be billed photocopy fees, it becomes apparent that this field hasn't shaken its paper-driven history.

Applications for Machine Learning

Primarily, eDiscovery involves working with large corpuses of human written text, and this means that the field of natural language processing (NLP) has many applicable principles for these cases.

Sentiment Analysis

Typically applied to content generated on social media platforms and review systems, sentiment analysis is the application of NLP techniques and computational linguistics to derive emotional attributes from text content.  This can be done in several ways:

  • Polarity analysis:  Is the overall language indicative of being positive, negative, or neutral?
  • Subjectivity/objectivity analysis:  Removing objective sentences from consideration can often improve results in sentiment analysis problems.
  • Feature or Aspect analysis:  Identification of certain features within a text that can be measured for sentiment independently.  This is useful in longer texts where multiple topics are discussed.

As techniques for sentiment analysis and computational linguistic continue to improve, it will offer serious benefits to eDiscovery.  I imagine a certain point where Discovery software might allow an analyst to query for:

     "documents between Custodian A and Custodian B, where Custodian A is angry",

or some other decipherable emotion that tells a story that you'd never find searching for a specific phrase.

Topic Segmentation

Topic segmentation algorithms can analyze text and divide it into distinct topics.  Because documents in eDiscovery can be quite lengthy (long emails, or dozens of pages in a report document) they can cover a multitude of topics that might not be indicated in data as terse as an email subject line.

The number of topics identified in entire discovery corpus might be massive, so by itself this might not be exceptionally useful.  Imagine, though, a legal analyst finding a relevant case document, noting the topics discussed in that document, and then being able to surface similar documents based on topical similarity identified by software.

Named Entity Resolution

The process of identifying proper nouns (people, and places) in text is Named Entity Resolution.  The value of this NLP technique should be pretty obvious.  Current technology allows litigation analysts to search for emails and documents exchanged between two or more parties (from Custodian A, to Custodian B, cc Custodian C, etc), but not documents in which a custodian is simply named.

Further, if two custodians of high concern in a case tend to consistently name the same person or place in their correspondence, perhaps there is a third custodian who's document are of concern and should be subpoenaed through the court.

Automatic Summarization

Again, the volume of text can be arbitrarily large on discovery documents.  Well tuned summarization algorithms (which, currently, need to be trained on topically similar types of documents) can save analysts a tremendous amount of time when scanning through text for relevancy.  These automatic summaries could provide very useful in an interface that exposes other tools to complement the analyst's workflow.


The major caveat to applying AI and ML techniques to discovery comes on the production side, which we haven't discussed much yet.

There are two sides to the discovery process, production and analysis (sometimes referred to as tagging).  The same software can play a role in both sides for eDiscovery, but production is about reducing data up front.

When a firm receives a "production" of documents, they need to analyze what they deem as relevant to build a case.  The firm that handed over, or "produced" those documents had similar needs, but their tools are geared towards finding the relevant information which they are legally obligated to produce, producing thatand nothing more.

The litigator who produced a document set has legal liability:  did they produce all of the materials which they were ordered to produce?  

We found this out when we tailored the tools built for our client to produce documents as well as analyze them.  Once our part was done, our consultants signed affidavits that were, essentially, testimony of exactly how our software performed each step of filtering the document set to end up with the production that was delivered.

I believe that the major players in eDiscovery (which are old, clunky, dumb, and very, very expensive) have the market somewhat cornered in this area, because their software has stood up in court for a decade.  We feel confident to go to a courtroom and defend the software we built because it will still simple enough in design to convince a judge of its validity.  We took all emails from the custodian, we searched for the key phrases specified in court documents, and returned all matches in the production.  Convincing the court of artificial intelligence algorithms surfacing documents based on a seemingly magic set of very small numerical coefficients will be more challenging, to say the least.

So, the biggest hurdle to applying AI and ML in eDiscovery is likely one of legal precedence.  That precedence isn't going to exist until someone builds the tools and tests them in the courtroom..

It's only a matter of time.

More from Lofty