Future Lawyer: eDiscovery and Generative AI Today – Part 2
Lawyers have been grappling with how and when to implement and maximise eDiscovery for years. The reasons why eDiscovery can be a challenging and seemingly expensive part of litigation are varied; however, we can link this back to a few “big picture” issues that eDiscovery practitioners face on each and every matter. Could generative AI improve the solutions to manage data volumes, legal input, and emerging data sources?
Data Volumes
As organisations move to the cloud, employees work remotely, and daily operations are increasingly carried out electronically, the amount of data created and retained by businesses has grown exponentially. So, when it comes to litigation and the disclosure stages of a proceeding, finding crucial documents and evidence in this vast quantity of data can be a daunting task. What’s more, the costs need to be proportionate to the claim's value: there is no point in spending 200k on a disclosure exercise if your claim is only worth 100k. The more data you have in scope for a matter, the higher the cost to carry out your disclosure exercise.
Legal Input
Effectively locating the documents which need to be disclosed is the key goal of eDiscovery; however, lawyers also need to be familiar with the disclosed documents and understand how they fit into the case timelines and the trial strategy. The more documents disclosed, the more documents the legal team need to be aware of and integrate into chronologies and trial bundles. Legal teams also need to be able to analyse and understand incoming disclosures and build those documents into their case. Furthermore, the role of eDiscovery and eDiscovery technology doesn’t end at the disclosure deadline!
Emerging data sources
Since businesses started moving to email, disclosure transformed from an exercise in reviewing and copying hard copy data to searching and reviewing information created electronically. In the last decade, the disclosure landscape has changed dramatically, with new data sources like mobile devices, chat apps, collaboration tools and an increasing need to disclose structured data such as financial accounts or proprietary databases. Tools to search and understand new kinds of data in context are often needed to deal with the data appropriately.
The tools and techniques available to help manage eDiscovery exercises have not only changed significantly over time but have also changed quickly and, in some cases, almost beyond recognition. Despite being directly responsible for some of the complexity of eDiscovery, technology can offer solutions as well – and trying to tackle an eDiscovery exercise without the smart use of technology can end up costing you far more than the technology itself. Improving solutions for data volumes, legal input, and emerging data sources would have a direct impact on speed, accuracy, and cost.
Cue jargon from existing eDiscovery solutions
Technology efforts to optimise eDiscovery begin with automatic deduplication, threading email conversations, and grouping near-duplicate documents. Despite its limitations, keyword searching is a staple of eDiscovery and is often crucial for early identification of key documents. Analytics techniques such as concept clustering can help group documents by theme and topic to cull data sets before review.
Machine Learning (often referred to in eDiscovery as Technology Assisted Review or TAR) is also frequently used to help identify relevant documents and prioritise their review – however, this technology wasn’t widely used until approved in court in several jurisdictions and arguably still isn’t used as frequently as it ought to be. TAR has gone through a few iterations, with the most widely used approach today being Active Learning, where a computer model continuously ranks and re-ranks documents based on their likelihood of being relevant, determined by decisions made by human reviewers. Bespoke technology like Sky Photo Search, Sky Smart Extract, and Sky Native Translation can also be used to make sense of different types of data – images, structured documents like invoices, or foreign language data.
Where could AI take us?
As with any advance in technology, there is an impact on the data that organisations store and need to interrogate for disclosure exercises, but also an opportunity to leverage this technology to help manage the eDiscovery process. Speculation on how Large Language Models (LLMs) might be used to speed up or entirely replace document review has been rife, and since the technology is evolving rapidly, its capabilities today might be very different from its capabilities in six months. An LLM model could be capable of selecting the most relevant documents for the case before human review or immediate disclosure, discarding all documents which aren’t relevant. An LLM model could also help with identifying “smoking guns” or identifying potentially privileged information. This approach could eliminate or drastically reduce the time and effort required to review documents.
Source: Chat-GPT
There are two technical possibilities for deploying an LLM to improve document review. An external and generically trained LLM could be pointed at the dataset for a case, or an LLM could be trained in-house on only the data for one specific matter. There are issues with both LLM configurations: a generically trained LLM may not be sufficiently specialised for the task and may be unable to fully understand or contextualise large data sets due to size limits on the inputs to the model. Training an LLM in-house may require computing power and expertise far above and beyond the resources available to most businesses, and the data sets in question may not be large enough to train the model fully.
An even more serious hurdle is the existence of AI “hallucinations”, where the model invents information and presents it as factual data. Many are aware of the personal injury case against Colombian airline Avianca in the Southern District of New York, where attorney Steven Schwartz used legal research on similar cases from Chat-GPT, which the AI completely fabricated. Even when asked explicitly whether the cited cases were real, Chat-GPT's insistence on its reliability and accuracy duped the attorney into not performing further research or validating the Chat-GPT findings.
It’s not impossible to imagine a situation where an LLM analyses a data set for relevant documents and then reports on facts or documents — which, although helpful, may be entirely fictional. Without serious due diligence and validation of the output of LLMs, their output may not be reliable enough for legal use. While AI document review might be possible in the future if LLMs evolve to a point where we can reduce hallucinations and provide better options for model training, other uses might be more realistic in the short to medium term.
Generative AI, such as Chat-GPT, is designed to generate new content, synthesise data volumes, and then repeat the content in different formats. This could alleviate huge weight from lawyers and free up their time for substantive legal analysis and strategy work. Tasks such as document summaries (especially when privilege logs are requested), chronologies, witness bundles and drafts of witness statements, and other tasks which involve collating, summarising and compiling data could be increasingly managed via AI. Contrasting this with the disclosure task of identifying existing relevant content shows that, while document review is a high-profile target for improvement, other tasks may well be more suited to immediate improvement with AI.
Many discussions regarding AI often become polarised, with some envisioning a world where AI completely replaces humans in the workforce, while others fear the negative consequences of more immediate scenarios. Realistically, risks and concerns are always being managed: not only the reliability of technology but also the ethical and legal questions around how it can be used. Using LLMs and generative AI can be a significant advancement, particularly for time-consuming tasks and better suited for technology.
In Part 3 of this series, we’ll look in more detail at factors beyond technical limitations which need to be accounted for when assessing AI for eDiscovery.
Read Part 3 ›
Back to Part 1 ›