Unlocking Health Innovation with Synthetic Data
HARRY KEEN: We have to be careful, right? First of all, the real-world data that is used to train or generate that synthetic data should be of high enough quality and should represent the statistical signals that you're interested in. Otherwise, you'll end up with a synthetic data set that might put you on the wrong foot.
ALEX MAIERSPERGER: Harry Keen, founder of synthetic data pioneer Hazy, and Dr. Mark Lambrecht, SAS's global head of health and life sciences, share how they saw the AI wave 10 years early, and what health leaders must do now to stay ahead in the decade to come.
[THEME MUSIC]
Harry, you founded a synthetic data company that was ultimately acquired. What opportunity did you see early on that perhaps others didn't?
HARRY KEEN: Yeah, so we started what was originally called Hazy back in 2017 when we spotted this challenge around sharing data quickly and easily. We saw in a very sort of small use case, initially. We were working at another startup, and we were trying to work with remote software developers. Wanted to give them access to our customer database, but couldn't find a quick and easy way to anonymize it, so we could share it with them without thinking twice, if you know what I mean.
So we set out to solve that challenge. We were using some of the early kind of named entity recognition technologies to try and figure out where personal data was and then doing simple sort of redaction techniques. But we quite quickly realized that there's a much bigger opportunity in the sort of World of AI and data science. So imagine if you could produce these safe data sets that you could ultimately feed into the growing world and data-hungry applications around AI.
So we realized quite quickly that the kind of concept of just kind of redacting data was going to destroy too much information. About the same time, this whole class of generative algorithms is coming out as well, where you could ultimately kind of replicate the statistical quality of the real data that you're working with, without compromising the kind of privacy and the sort of identity of the people in that real data set. So it became a sort of marrying the a really new kind of cutting edge technology with a kind of old and existing problem, which was sort of a growing problem, as it were.
ALEX MAIERSPERGER: You helped us a little bit. Can you dive in a little bit deeper to help me understand what synthetic data is and where it plays in that? I think there are some big headlines that we've seen. One is about that anonymized data. That there was a country that used it for some of the health information, and it got sort of re-anonymized or de-anonymized. It was tracked back to the real data very quickly once it was out in the public. And so there was some fear on that side of it.
And then there's some fear, maybe because of that, but there's been some headlines around of backlash for fake data of, we already have mountains of real data-- we don't need this data. Where does synthetic data fit into those two extremes?
HARRY KEEN: Yeah, absolutely. So firstly synthetic data, the way it works is you take a real data set. You train a generative algorithm against that source data to understand all of the statistical patterns, correlations, distributions within those data. And then that model effectively captures that information. And you can ask that model to generate row by row of totally synthetic, totally made-up data records that individually, they wouldn't exist in real life. They'll just look realistic because it's the model has learned what a realistic data record should look like.
And then you can keep on generating these records, and you end up with a fully synthetic data set that's critically doesn't contain any of the real information but does preserve the statistical characteristics. Now, within that process, we also apply a privacy technique called differential privacy, which provides a sort of mathematical guarantee around the level of privacy that you're achieving.
So to come back to your point there, So where does synthetic data sit on a spectrum? because there's very, very valid concerns that have been around for a while with anything to do with anonymization. You're always trading off between privacy on the one hand and utility on the other. So if you dial privacy right up, you just delete the data set. If you dial it right down again, you would have a totally raw data set, but it's totally unprivate.
So somewhere in the middle, you've got a trade off between the two. So synthetic data, with differential privacy, it is the most efficient way to make that trade off. So you could end up with a useful data set for tons of use cases whilst preserving the privacy of the individuals within that. So it's a kind of a complex answer really to quite a simple question. But it's synthetic data has its uses, and it's you've got to understand its limitations as well and depending on which use case you're going to apply it to.
ALEX MAIERSPERGER: Mark, let's talk about some of those uses in the health care and life sciences realm. Has synthetic data become an area of interest in these organizations that you're working with? And maybe on a scale of 1 to 10, we'll give you this scale as well. So 10 being most interested, where are we in the either interest or adoption for synthetic data?
MARK LAMBRECHT: I think we could say, Alex, that we're at a 4 or 5. Yes, there is interest with all of the executives and decision makers and users that we're talking to about synthetic data, but it's really only starting. And of course, health care and life sciences needs to deal, as Harry has explained, with a lot of things here are not in the least ensuring that patient information cannot be de-identified. And so there is interest.
But out of a survey that we did of almost 500 decision makers in the industry, we see that more than 75% of those decision makers are really busy in terms of AI with, How can I guarantee security? And that's where partially synthetic data comes in. We see that regulators are talking about it. FDA, European Medicines Agency for pharmaceutical companies, healthcare companies are using it because there's a big promise of using so-called electronic health records and combining that with other types of data.
But I think synthetic data now has some narrow scoped applications, such as the ability to avoid bias, to test out certain algorithms, AI algorithms that kind of dry run them before you can actually apply them in real life. It's also being used for a number of other areas too. For example, if you have underrepresented populations, like in rare diseases where you can look at, you can really increase the data from that underrepresented population and ensuring that you will detect the actual signal once you apply it to a real-life situation.
So apart from these narrow-scoped applications, there's definitely a lot of interest. It's not new. Even in academia, synthetic data has been looked at for many purposes. But what you're really seeing is that industry is now jumping on the bandwagon as well.
ALEX MAIERSPERGER: Really exciting and sounds like very early days still. So some adoption, but quite a bit of interest. Harry, at the same time, there's industries that are further ahead on the both interest and adoption curve in their use of synthetic data than health care and life sciences, certainly. I think in the press release, there's financial services, insurance, telecoms. They've been early movers. What lessons can health care, can life sciences learn from those industries that are already further along that adoption curve?
HARRY KEEN: Yeah, I think it's really important to-- and this is a kind of a learning curve that all organizations go on is sort of when you describe the benefits of synthetic data, it sounds like a magic bullet to every single potential problem. That's just not the case. There's a set of use cases and applications where it's really, really good and others where it won't be applicable. So you need to think really carefully about the utility requirements of the use case. Does this data need to be hyper-representative, or can we live with because this is an experimentation use case, we can live with a 2% reduction in accuracy of the models that's going to produce, because it's ultimately just a testing exercise?
What are the privacy requirements? Are we sharing this externally? Is this internal? So you really weigh these things up to understand where the synthetic data is actually going to be appropriate for the particular use case. That just takes a little bit of experience and certainly a bit of expertise, types of expertise that SAS can provide to assess these initial use cases. So just being able to save a huge amount of time in these applications and projects with a discount of that data as a way to speed up the process or including it at an early stage.
That also, if you get these decisions right early, you save, I think, a whole series of explanations and ultimately build trust that this can be a useful technology for the specific problems you're applying it to. So it's all about that process, just understanding the scope and limitations of the technology and then applying that in a sensible way.
ALEX MAIERSPERGER: Mark, what are the limitations health care and life sciences-wise? Where should synthetic data-- we've talked a little bit about where it can be used, rare disease and some of the other opportunities. Where should it not be used in health care?
MARK LAMBRECHT: Yeah, I think, well, first of all, we're very happy to have that Hazy IB available to us and to be able to apply it to health care and life sciences. But we have to be careful. First of all, the real-world data that is used to train or generate that synthetic data should be of high-enough quality and should represent the statistical signals that you're interested in. Otherwise, you'll end up with a synthetic data set that might put you on the wrong foot.
I don't think we need to replace it to really look at clinical signals. There's hesitation from medical doctors or clinicians that synthetic data would actually be used to derive real insights. That's not the goal. The goal here is to ensure that we can dry-run some of the processes, that we can avoid some of the bias that potentially might creep in and things like that.
However, it's not really meant to replace the actual clinical data as such. Otherwise, we would not be doing clinical trials that cost hundreds of millions of dollars, of course. In addition, there are factions, rightfully so, in the clinical ecosystem that don't believe that even the real-world data could replace clinical trials, let alone synthetic data, such replacing those clinical data sets. Having said that, it's complementary, I think. We need to install, as Harry rightfully explained, a trust-building exercise to show where it can build benefit, where it can complement certain techniques, and where it can fit in an end-to-end digital capability, where you can apply analytics and broad sets of analytics on that type of data, both in health care and hospitals, to look at predicting whether patients will be highly susceptible to certain diseases or sepsis, or have to come into the hospital again, or in pharma, looking for responders to New therapies.
So it doesn't really diminish all the other things you need to do. It just helps you to be more prepared for when the moment comes. I really compare the ability to generate synthetic data is a bit like training before a big sports match. You won't start running a marathon but without having trained well. You shouldn't start the clinical trial that costs hundreds of millions of dollars and implicates many patients and affects patients' lives as they sacrifice really their data, their lives to joining such an experiment. You shouldn't do that without having tested out all of the eventualities that would happen during that real-life exercise as well. So that's where we really have to be careful and where we really have to watch out for overuse of synthetic data. It's definitely not a magic wand, but it is needed and helpful.
ALEX MAIERSPERGER: I won't be running a marathon even after training. So I think your example is spot on about that it would be less painful if I did, I guess. I'm so excited to talk to both of you about the future. It seems, obviously, you created and you saw this need and this opportunity before others did. We're very early on, and so I can't wait to hear what you think is coming next and what the next opportunities are. But before we get there, we're going to go into a speed round, so to get to know both of you a little bit. The first one, I think, speaks about what kind of data you consume. So Mark, what is your favorite app on your phone?
MARK LAMBRECHT: Well, I'll take a bit of liberty here, Alex, and take two. One is Substack. I love to go to Substack to read on very high quality information. The other is Overcast. I find myself constantly opening Overcast. That's a podcast, by the way, a very good one. So apart from this podcast, I listen to a lot of them. And so they're a bit part of my habit, part of what I do.
ALEX MAIERSPERGER: Substack and Overcast. Harry, how about you?
HARRY KEEN: Oh, for me, my favorite app is actually AllTrails. I'm pretty sure you get that out in the US as well, but it's brilliant. I love technology that encourages more sort of outdoor activities. It's incredible the sort of community-driven, I guess, contributions to all of these different trail walks you can do around wherever you are. It's amazing the covers they've got. So yeah, I would highly recommend it. It's a great way to get out and about.
ALEX MAIERSPERGER: AllTrails. So you may have already answered the next one, your favorite way to exercise.
HARRY KEEN: [LAUGHS] Well, so I live in Lisbon now. Yeah, there's some great hiking and trails around here up in the beautiful place called Sintra. But this area of Portugal is really known for its surfing and its water sports as well. So try to spend in between work and family, trying to spend as much time out on the water as well.
ALEX MAIERSPERGER: Did you hear that, Mark? I think we were there was a quiet moment there that we were invited to a surf trip in Portugal.
MARK LAMBRECHT: I'm all for it. I'm ready. Just say when.
ALEX MAIERSPERGER: Your favorite way to exercise, Mark?
MARK LAMBRECHT: I'm an avid badminton player, so I love that. So I do it a couple of times a week. And this is where all of the pressure from the hard work gets out.
ALEX MAIERSPERGER: All right. Badminton is on as well. Are you a morning or a night person?
MARK LAMBRECHT: I am definitely-- everyone who has worked with me in morning and early mornings knows that I'm a night person. I wake up at night and I love it when everybody else has gone to bed, and I can still work with the quiet of the night.
ALEX MAIERSPERGER: Harry, is there an entrepreneur story in here of all-nighters or are you a morning person?
HARRY KEEN: There definitely used to be. And then I used to think I was a morning person. But then we had our daughter and suddenly, the mornings became a lot more difficult. So yeah, I think I'm leaning more towards Mark's preference now as well.
ALEX MAIERSPERGER: All right. One place on your travel bucket list.
HARRY KEEN: For me, well, I'd love to go to the Mentawais actually in Indonesia again to get-- they have world-class surfing out there. So it's a trip of a lifetime. I will do at some point when time allows.
ALEX MAIERSPERGER: I love that. I love the will, will do. Mark, how about you?
MARK LAMBRECHT: Tough choice. But for me, it would be cherry blossom time in Japan. I'm a bit of an amateur photographer, and I would love to be there. But the problem with that is you need to be there at a very specific moment, that is when nature calls. They open up, and then you have to be there. That's kind of tough to plan if you want to go to Japan. But I want to go partially for the beautiful culture they have there, but also maybe to see the digital health ecosystem and learn a little bit more about that because they have some things figured out pretty well.
ALEX MAIERSPERGER: Absolutely. All right, this is a guest favorite and a listener and viewer favorite. So we've got some avid fans that are dying to know the answer here of your favorite flavor of ice cream, Mark.
MARK LAMBRECHT: Well, I'm Belgian. We have a sort of ice cream, vanilla ice cream. It's called dame blanche, white lady in French, and it's vanilla ice cream topped with melted, of course, Belgian hot, warm dark chocolate. So that combination always gets me going.
ALEX MAIERSPERGER: All right, that might be an unfair advantage. There's a reason why it's Belgian chocolate all over the world is off the tip of your tongue. How about you, Harry?
HARRY KEEN: Ooh, I think it would have to be hazelnut, actually. It's that kind of Ferrero Rocher flavor. I just find it absolutely delicious and can have multiple scoops in one helping. So there's this sort of thing where once you get started, you can't stop. So it's something I try not to start too often.
ALEX MAIERSPERGER: Ice cream and AllTrails. I think it matches you balancing it out.
HARRY KEEN: Exactly.
ALEX MAIERSPERGER: Harry, we had a guest, I believe she was a hospital leader. And she had said, we don't need more data. We just need to better use the data we have. Do you come up against this sort of feeling, that hospital health plan, life science leaders, or could be financial services? Do industry leaders tell you that? How do you overcome that objection? Can you say synthetic data helps with the data you do have, and here's what-- or is it helping them prepare for the future?
HARRY KEEN: Yeah, it's funny. I think you could see that as an objection. Quite often, the problem with trying to use the data they do have is that it's locked behind consents and controls, and all of these barriers that make it very difficult to use the data that they do have. So actually, it can be slightly flipped that challenge. And it can be an enabler, ultimately, to go and use the data you do have, because ultimately, synthetic data is the data you do have, particularly if you're using it in a privacy-preserving way.
So it's an unblocker. That should really be how it's seen. Of course, that doesn't touch on the potential additive use cases that Mark's mentioned around augmenting data sets and balancing data, et cetera. But fundamentally, unlocking the data you do have can be achieved with synthetic data. So this is a way to enable that viewpoint, actually.
ALEX MAIERSPERGER: I love that. That's CEO, co-founder, and chief salesman for synthetic data. That's really helpful. Mark, if you were advising a hospital CEO or a pharma R&D leader, where would you tell them to start experimenting with synthetic data?
MARK LAMBRECHT: Yeah, that's a good question. I think it depends a bit on Indeed on what seat you're sitting. If you're a CIO of a hospital or a hospital health system, you're likely to already have quite good quality of healthcare data available. Nevertheless, the problem in health care and life sciences since many years I've been walking around in this industry, indeed, there is enough data. The problem is it's fragmented. It's in different formats. It's owned by different stakeholders that are not necessarily ready to give it up.
So I think what I would be advising is to use synthetic data as a way to overcome some of those challenges because we cannot stop. We cannot keep the patients waiting for therapies, or we cannot wait to provide them a better personalized journey into your hospital. So synthetic data is a way to test out your systems. It's kind of oil in the engine. It might not replace the actual gas that you need to keep the engine running, but it will definitely improve the efficiency of what you're doing.
So that paradigm helps, and it's not a standalone thing. I think I would also advise to embed it in their AI governance infrastructure, make it trustworthy, make it reliable, make sure that you got your data end-to-end chain of integrity in place as well that you can show to the patients, how you came to certain conclusions, and to your physicians as well. Also make it apparent where you use synthetic data and where not. So there is no distrust or nobody harmed in the process.
So it's a matter of trying it out specifically. If you talk about very specific topics, clearly, in clinical trials, for example, there is a lot of modeling and simulation going on. A lot of decisions have to be made that are high stakes. Lots of investment in certain trials, where do you start? Where do you pick out the right trials and the right medicines? That's a daily concern for every C level at a pharma company, for example. Synthetic data can help you to ensure that you reduce the risk. It won't take away the entire risk, but it can reduce the risk.
On the healthcare side, there's a lot coming to us. We're just at the start of combining the electronic health record, the record data that has been stored for years now and to use it for personalized medicine, because we can really look at that as a predictor now. Because of the way we combine it with other data, multimodal data, data coming from other areas like radiological data or other types of data from the patient, you might not have access to all of that data. So maybe synthetic data can start that journey as well. So use it to experiment, to test, to fine-tune your approach, would be my advice.
ALEX MAIERSPERGER: If you're still on the very sort of earliest days of this adoption, even interest process, and you're not involved in synthetic data now, are there things that executives can be doing today that will prepare them for broader adoption in the next three to five years? Is there infrastructure they need to be investing in? Is there data prep and cleansing and organizing that they should be doing now? Is there anything that you can advise them on if they haven't dabbled in the synthetic data world yet, that would be beneficial?
MARK LAMBRECHT: From my end, it would be do what you're currently doing to prepare yourself for the whole generative AI implementation, I would say. GenAI is not replacing the need to structure your data to ensure that your semantic layers are in place, that you understand where your data is. On the contrary, you need more governance. You need more guardrails. You need more combination with deterministic and other algorithms that are already existing. It's coming on top of it. It's not just putting aside the need to bring that in place.
Same is true with synthetic data. We get a lot of questions on, OK, we have these standards here, CDISC standards, and also in Europe with the European Health Data Space, you see these nations bringing together data from different countries. There's a lot of data engineering that will be needed. Even there, synthetic data can help, not just in the fancy algorithmic stuff, but just in testing out those data pipelines and making them more robust and testing them for errors, et cetera. So that's what I would keep doing.
HARRY KEEN: Yeah, I totally agree with that. I think there's not a huge amount of difference here to preparing for a bigger AI and enabled organization in terms of the kind of infrastructure preparation, et cetera. But what I will say is synthetic data can be a part of that. So don't wait. It's very easy to test and learn now today with the way Data Maker SaaS particularly is deployed and support the SaaS can provide.
So there's no need to hesitate in terms of bringing in some of that technical, legal, sort of commercial expertise and understanding what the limitations and capabilities of that data are. And then whilst it can be seen as a bit of a point solution to a particular user case at a particular moment in time. It could start to be seen as a more strategic technology that you can weave into bigger projects that Mark was mentioning there as well. So it's getting this expertise in house and building up that sort of understanding is really critical. I would say there's no need to wait for that to happen. So yeah, it can get cracking immediately.
ALEX MAIERSPERGER: We've talked about some potential risks. We've talked about some potential use cases and some of the potential objections. And I know you've turned that around a little bit that objections can be opportunities. And maybe there's limited knowledge, or you just haven't been able to access this technology before. And so we've talked about some of that sort of earliest adoption curve. Let's go to the far end of the adoption curve-- people that are gaining value from synthetic data today and that are saying, this is a core part of our operations, is a core part of our research. It's a core part of our future.
Harry, fast forward 10 years from now, how do you see synthetic data reshaping the future? Is this societal-level change? And is this every organization will have simulations running on synthetic data alongside their data that they've got? Where do you see the future unfolding?
HARRY KEEN: Yeah. Well, I think the story of and the future of synthetic data is very intertwined with the story and future of AI in general. So I think particularly in health care, I think it's not difficult to imagine quite a big transformation in the way diagnostics works and patient interactions, all the way through to hands on things like surgery.
I think synthetic data will play a sort of an enabling part in all of that. I mean, if to be a bit more specific about synthetic data, I really do believe there's a very near-term world where organizations will operate a sort of synthetic twin of their data, of their infrastructure. That will make experimentation and testing and bringing in some of these new technologies much, much quicker and easier with much less risk. So there's a feature there that I think synthetic data really enables in a sense.
ALEX MAIERSPERGER: Really exciting. Mark?
MARK LAMBRECHT: Yeah. Clearly, I would like to give an example concretely of pharma, and that is in drug discovery. Today, synthetic data is starting to be used to develop chemical structures that do not exist yet in reality, and to test out whether they bind, for example, to a protein or a target in the body, and to see if that can become finally a therapy to affect a disease or downstream.
So think about synthetic data as a way also to broaden the information space to test out new approaches, even though you base it on real data, that you can tweak to inform you of things, to be creative, really in terms of what you're looking at in health care and life sciences. I think with all what we're hearing about quantum computing, about high-performance computing in general, synthetic data will move along the real world techniques and analysis, such as the ability to combine different data sets, et cetera. But what we have to realize is that what we are testing out in with patience is only a small fraction of what could be.
And so by evolving synthetic data alongside between, quotes again, the "real" data, you'll be ready to test out things that you didn't know what would happen. What if a patient came in from a minority population that never was in that hospital, but that has a certain disease that you didn't see yet, et cetera? We need to be able to be ready for that as well, and more importantly, collaborate between heathcare systems between countries. I think synthetic data will help us to be the pioneering, to run ahead of where we can send the real world data. So I think it's really opening up a massive opportunity for life sciences and health care and to vastly look into areas where we're not looking in today.
ALEX MAIERSPERGER: Well, Harry, we have your vision to thank for creating Hazy and creating this synthetic data opportunity for so many of our customers. I can't wait to experience the future that this enables for organizations to be able to find cures to rare diseases, and to be able to find better interactions between patients and physicians. It just sounds tremendously exciting, and I love that you focused on how close it is. This isn't far off future. This is happening in organizations and in pockets around the world today and will just continue in the future. And so, Harry and Mark, thank you so much for being here.
MARK LAMBRECHT: Thank you.
HARRY KEEN: Thank you, Alex.
ALEX MAIERSPERGER: Thank you for joining us. If you'd like to come with us on our around-the-world surf trip or challenge us to a match of badminton, please let us know In the comments or send us an email, thehealthpulsepodcast@sas.com. See you next time.
Creators and Guests
