Kirsten Langendorf wanted to help find solutions to the current pandemic, so she focused her mind on a covidgraph. In this article, she shares her experiences and solutions.
The Shutdown
On March 11, 2020 the Danish government decided to shut down many activities due to the COVID-19 pandemic and I now had to work from home. The result was that I didn’t do much besides working. We were advised to limit social activities, which is very hard when you are doing so much collaborative work in the office. A letter was sent out to all Danish citizens asking everybody with a health education background to volunteer for the COVID-19 preparedness team. Well, with math and computer science in my educational bag, I could not help out. I work with software development and linked data. Then at the end of April I received one of those news emails from Neo4j linking to the covidgraph.org.
Covidgraph.org Aims
The covidgraph.org aims, “to help researchers quickly and efficiently find their way through COVID-19 datasets and to provide tools that use artificial intelligence, advanced visualization techniques, and intuitive user interfaces. This allows to explore papers, patents, existing treatments and medications around the family of the corona viruses.”
This was exactly the opportunity I was looking for; they needed volunteers, so I signed up. They wanted to add data from ClinicalTrial.gov. With my experience with clinical trial data from the pharmaceutical industry, I could combine that knowledge with my linked data and computer skills and in this way contribute to improving the knowledge of this new disease. I could help defeat COVID-19 in my own nerdy way and learn something at the same time.
So, if you are not into nerdy Cypher stuff then skip the next section and go to the end.
Being blessed with access to the internet and all the tips and examples shared out there, I managed to write the Cypher queries needed to load the clinical data into the covidgraph. So, I thought it would be in the spirit of our new world order to share what I did and, in this way, also share my tips and examples.
The Source Data – ClinicalTrials.gov
Clinical trials that are conducted under an investigational new drug application must be registered in this database. Both the clinical protocol information and the study results. The data is publicly available and can be downloaded using their API: https://clinicaltrials.gov/api/gui/demo/simple_study_fields.
You can get data in xml, json and csv and only up to 1000 records per request. For the Cypher I used the json format.
After a bit of reading on the syntax and the documentation study fields I decided to split the json files into 3 type of studies:
- Observational
- Interventional
- All other – not the above
Due to the 1000 limit, I needed to loop in chunks of 1000. So getting the json files looked like this:
- (COVID and Interventional studies, study fields: NCTId,OrgStudyId,BriefTitle,Acronym,OfficialTitle,StudyType):
call apoc.load.json('https://clinicaltrials.gov/api/query/study_fields?expr=COVID+AND+AREA%5BStudyType%5DInterventional&fields=NCTId&fmt=json&max_rnk=1000') yield value with value.StudyFieldsResponse.NStudiesFound as NStudies, RANGE(0,(value.StudyFieldsResponse.NStudiesFound/1000)) as nloop UNWIND nloop as i with range(1+1000*i,1000+1000*i,999) as RANGES with RANGES, RANGES[1] as urange, RANGES[0] as lrange call apoc.load.json('https://clinicaltrials.gov/api/query/study_fields?expr=COVID+AND+AREA%5BStudyType%5DInterventional&fields=NCTId,OrgStudyId,BriefTitle,Acronym,OfficialTitle,StudyType&min_rnk='+lrange+'&max_rnk='+urange+'&fmt=json') yield value with value.StudyFieldsResponse.StudyFields as coll unwind coll as study_metadata
Similar queries for the other types of studies and different study fields.
- The json returned looks like this (one record):
{ "NCTId": [ "NCT04384588" ], "OrgStudyId": [ "FALP 001-2020" ], "StudyType": [ "Interventional" ], "BriefTitle": [ "COVID19-Convalescent Plasma for Treating Patients With Active Symptomatic COVID 19 Infection (FALP-COVID)" ], "Rank": 1, "Acronym": [ "FALP-COVID" ], "OfficialTitle": [ "Investigational- Compassionate Use of Convalescent Plasma From COVID-19 Donors in Oncological and Non-Oncological Patients With Severity Criteria: FALP 001-2020 Trial (FALP-COVID)" ] }
Getting the source data linked
Then came the trickier bit of deciding how to model the study data in a graph. At first, I put a lot of the fields as properties to the main node which I called ClinicalTrial identified by the NCTId field. However, this is not such a good approach, when trying to match with the covidgraph ‘ecosystem’. So, a new version was developed putting most fields as node mainly relating to the ClinicalTrial node.
The json returned a set of lists. Making a property as a list is also bad for querying, so the data needed to be UNWINDed:
UNWIND study_metadata.NCTId as Id UNWIND study_metadata.StudyType as StudyType merge (ct:ClinicalTrial{NCTId:Id,data_source:'clinicaltrials.gov',url:'https://clinicaltrials.gov/ct2/show/' + Id}) MERGE (st:StudyType{type:StudyType}) MERGE(ct)-[:IS_TYPE]->(st) WITH Id, ct, study_metadata UNWIND study_metadata.OrgStudyId as OrgStudyId UNWIND study_metadata.Acronym as Acronym merge (si:StudyIdentification{studyId:OrgStudyId, acronym:Acronym}) MERGE(ct)-[:HAS_IDENTIFICATION]->(si) WITH Id, si, study_metadata UNWIND study_metadata.BriefTitle as BriefTitle UNWIND study_metadata.OfficialTitle as OfficialTitle MERGE (t:Title{briefTitle:BriefTitle,officialTitle:OfficialTitle}) MERGE (si)-[:HAS_TITLE]->(t)
This bit of the graph then looked like this for NCTId =NCT04384588:
The full graph has this schema:
The nodes and their properties:
Node | properties |
---|---|
ClinicalTrial | [data_source,NCTId,url] |
StudyIdentification | [acronym,studyId] |
Title | [officialTitle,briefTitle] |
Status | [status] |
StopReason | [reason] |
Start | [date] |
Completed | [primaryCompletionDate,completionDate] |
Responsible | [type] |
Investigator | [name,affiliation] |
Sponsor | [name] |
Collaborator | [name] |
Response | [YN] |
Description | [summary,detailed] |
Condition | [disease] |
Keyword | [word] |
InclusionCriteria | [criteria] |
Design | [model,name,description] |
ObservationPeriod | [time] |
BioSpecimen | [description,retension] |
Arm | [name,description] |
Intervention | [type,name,description] |
Outcome | [name,description,type,time] |
StudyPopulation | [name,sampling] |
Gender | [name,description] |
AgeRange | [maxAge,minAge] |
ExclusionCriteria | [criteria] |
Contact | [name,email] |
Facility | [name] |
City | [name] |
Country | [name] |
PaperId | [id,type] |
Citation | [name] |
ReferenceType | [name] |
Link | [url] |
StudyType | [type] |
Purpose | [name] |
Phase | [phase] |
The inclusion and exclusion criteria were returned in the json as a combined Eligibility Criteria field:
{ "NCTId": [ "NCT04384588" ], "Rank": 1, "EligibilityCriteria": [ "Inclusion Criteria: For all patients: A. Patient must sign an informed consent to participate in this trial
B. Signed consent to participate in this trial must be given not after 14 days from the first day of symptoms COVID-19 related
Patients with severity criteria must have any of the following: dyspnea and or respiratory rate >=30 per min and or saturation <= 2 50 93% with fraction of inspired oxygen 21% and or ratio partial pressure arterial (pafi lung images showing worsening in 24-48 hours patients without severity criteria but more factor risks: a. years older b. any the following comorbidities: diabetes mellitus, hypertension, chronic obstructive pulmonary disease, kidney failure, non-oncological related immunosuppression c. total bilirubin>1,2 mg/dl or Blood Urea Nitrogen> 20 mg/dl or Lactate Dehydrogenase>245 U/L D. D-dimer > 1mg/L E. Neutrophils 7.3 x 10³ or greater and or Lymphocytes lesser than 0,8 x 10³ µl F. C reactive protein >9,5 mg/dl and ferritin > 300 ug/ml G. Interleukin-6 >7 pg/mL H. antineoplastic treatment such as radiotherapy- cytotoxic chemotherapy- immunotherapy- molecular therapy- oncological surgery during the last 8 weeks
Exclusion Criteria: known allergy to plasma Severe multiple organic failure Active intra brain hemorrhage Disseminated intravascular coagulation with blood products requirements Patient with an adult respiratory distress longer than 10 days patients with active cancer and life expectancy shorter than 12 months according with medical criteria" ] }
The query then needed to extract the exclusion (and inclusion) as individual nodes:
UNWIND study_metadata.NCTId as Id match(ct:ClinicalTrial{NCTId:Id}) with ct, study_metadata UNWIND study_metadata.EligibilityCriteria as EligibilityCriteria with study_metadata, ct, split(replace(replace(trim(substring(EligibilityCriteria,length(split(EligibilityCriteria,"Exclusion")[0])+19,size(EligibilityCriteria))),'\n','#'),'##','#'),'#') as Exclusion, split(replace(replace(trim(substring(EligibilityCriteria,19,length(split(EligibilityCriteria,"Exclusion")[0])-19)),'\n','#'),'##','#'),'#') as Inclusion with study_metadata, ct, Inclusion, Exclusion, RANGE(0,size(Inclusion)-1) as nincl FOREACH(i in nincl | MERGE(incl:InclusionCriteria{criteria:Inclusion[i]}) MERGE(ct)-[:HAS_INCLUSION_CRITERIA]->(incl)) with study_metadata, ct, Inclusion, Exclusion, RANGE(0,size(Exclusion)-1) as nexcl FOREACH(i in nexcl | MERGE(excl:ExclusionCriteria{criteria:Exclusion[i]}) MERGE(ct)-[:HAS_EXCLUSION_CRITERIA]->(excl))
The json for the country, city, and facility looked like this:
{ "NCTId": [ "NCT04366271" ], "LocationFacility": [ "Hospital Universitario de Getafe", "Hospital Universitario de Cruces", "Hospital Universitario de La Princesa", "Hospital Infantil Universitario Niño Jesus", "Hospital Ramón Y Cajal", "Complejo Universitario La Paz" ], "Rank": 2, "LocationCity": [ "Getafe", "Barakaldo", "Madrid", "Madrid", "Madrid", "Madrid" ], "LocationState": [ "Madrid" ], "LocationCountry": [ "Spain", "Spain", "Spain", "Spain", "Spain", "Spain" ] } }
It was evident that these were to be paired into 6 records. The query for this was:
UNWIND study_metadata.NCTId as Id match(ct:ClinicalTrial{NCTId:Id}) WITH Id, ct, study_metadata, RANGE(0,size(study_metadata.LocationFacility)-1) as nfacil FOREACH(i in nfacil | MERGE(fa:Facility{name:study_metadata.LocationFacility[i]}) MERGE(ci:City{name:study_metadata.LocationCity[i]}) MERGE(c:Country{name:study_metadata.LocationCountry[i]}) MERGE(ct)-[:CONDUCTED_AT]->(fa) MERGE(fa)-[:LOCATED_IN]->(ci) ) WITH Id, study_metadata, RANGE(0,size(study_metadata.LocationCity)-1) as ncity FOREACH(i in ncity | MERGE(ci:City{name:study_metadata.LocationCity[i]}) MERGE(c:Country{name:study_metadata.LocationCountry[i]}) MERGE(ci)-[:LOCATED_IN]->(c) )
Which resulted in this graph:
Covidgraph Learnings
Well, enough of the nerdy stuff. So, what have I learned from volunteering for the covidgraph.org project?
I took the Neo4j Professional certification about a year ago but hadn’t had a chance to use it to a great extent. With this task I learned much more about Cypher querying and reading json files.
I am sure that the model I have made could be better. Feedback is much appreciated. However, sometimes data quality prohibits the right model to be made. E.g. I wanted to link the Interventions to the Arms, but the data was incomplete and didn’t allow this programmatically. I will continue to improve and extend the model. Next step will be clinical trial results.
The team behind covidgraph.org is a very efficient virtual team and always there for you, to help and guide. Being part of the that team and adding a piece of information to the bigger picture is a meaningful task for me – I can help! I hope that adding clinical trial data to the graph will provide knowledge for the researcher using the graph.
I think having more teams like covidgraph.org could help us understand complex matters by putting data from various sources into a meaningful graph. Specifically, for the pharmaceutical industry, a graph of pharmaceutical products could be made providing an intuitive overview of drug classes, their effect and side effects. A first step could be to create a clinical trials graph using the ClinicalTrial.gov data. Perhaps this could be a working group within Phuse.eu?
Want to know more?
If you want to know more about the covidgraph, then do get in touch with Kirsten via our contact page and, if you want to understand how we use graphs to develop our software, sign up for our A3 Community MDR or get in touch with the A3 Team.