Responsible Web Scraping: Challenges and Approaches

Episode #001

43 min listen

Episode #001

43 min listen
“Web scraping is neither legal nor illegal. It's how you use it and what you scrape”
Ondra Urban, LinkedIn
Description

The topic of this episode of Ethical Data, Explained is ethical data and web scraping. Henry Ng is joined by Ondra Urban to discuss Ondra's time at Apify, making the web more programmable and accessible, maintaining ethical standards as a web scraping company, and the implications of the HiQ Vs LinkedIn case.

Learn more from Ondra Urban on maintaining ethical data standards as a web scraping company.

Transcript

Henry Ng - 00:00:00:

Welcome to Ethical Data Explained. Join us as we discuss data related obstacles and opportunities with entrepreneurs, cybersecurity specialists, lawmakers, and even hackers to get a better understanding of how to handle data ethically and legally. Here to keep you informed in this data saturated world is your host, Henry Ng. Hi, everyone. Welcome to Ethical Data Explained with SOAX. I'm your host today, Henry Ng, the VP of Sales and Partnerships, and I'm joined today with an individual who is from Apify. He started his career in the legal field, moved into software development, and now sits as the CEO at Apify. His name is Ondra Urban, and we want to welcome him today. Hi there, Ondra. How are you today? 

Ondra Urban - 00:00:45: 

Oh, I'm great, Henry. Thanks for having me. This is super exciting. I've never been on a podcast before, so I'm really enjoying it already. 

Henry Ng - 00:00:52: 

Good, I'm really glad. I'm really glad. We're really happy to have you here today. And I think the first thing we want to start with is really to know a little bit more about you, your career, and your experience. So I'll pass over to you and you can give us a little bit of history on Ondra, basically. 

Ondra Urban - 00:01:06:

All right, yeah. Well, I started as a lawyer a long time ago, but I always found the job a bit more complicated than I expected. It's a lot of conflict, it's a lot of unpleasant situations, and you don't really see yourself as a person going forward towards a solution, more of a person that's trying to break other people's solutions for insignificant problems, I would say. And while I have huge respect for all my friends that I had had from law school and that they still do the job, and in the end, it is pretty satisfying to win a court case or to be able to advise on huge billion-dollar deals and so on. So it's not like it's a bad job. I just didn't like it. For me, personally, it wasn't a fit, and I wasn't very happy about it. I was like, what am I going to do? I wasn't sure about that. I didn't know anything else, to be honest. So I was like, okay, so maybe I could be a project manager. You don't need a specific education for this. Like, if you're smart and you study a bit, you'll figure it out. So I applied to different project managing jobs in various startups and tech companies because I always liked computers and technology and stuff like that. And I always got to the second stage, third stage of the interviews, and then they would tell me, like, dude, you're great, you're great, but you just don't have the skills. 

Henry Ng - 00:02:35: 

You're not great to hear all the time, but I'm guessing it drove you to kind of aim for something else and aim a bit higher as well. 

Ondra Urban - 00:02:42: 

Exactly. I figure this out after a few interviews that I actually don't have the skills. Right. And I applied to college mostly like full-time, working in part-time, studying to become a project manager, like a tech project manager or It project manager. But because of that, I had to learn the basics of programming. And I never programmed before or maybe once or twice in my life in school or something. I never really paid much attention to it, but because of that, I had to learn it. And it took only maybe two weeks for me to absolutely fall in love with it. It was crazy. I was literally working full time and then studying full time, learning programming from you to me and all kinds of online courses, course era. I was living on YouTube in those places and just trying to figure it out and piece the knowledge of software engineers together online. And actually nowadays you have amazing resources for this. So I guess anybody who has the dedication and I had the dedication not because I really wanted to become a software engineer more, I really didn't want to be a lawyer anymore.

Henry Ng - 00:03:50:

So I had to drive to do something else. 

Ondra Urban - 00:03:53: 

Yeah, exactly. And it worked out for me really well because after maybe a year of studying, I was able to land a job. It was a pretty good job. We learned a lot. We were working in a really small team on a server, like a back end framework for a big company that does business with banks and energy companies and really like big things. So I had got the opportunity to tap into their knowledge about really complex systems. And I was working with this super kind and amazing software architect that basically taught me everything about software development and backend development. After a year, this project ended because we kind of built it and that was it. And we were like, okay, so you can be here. You can kind of maintain it and work with the system. I was like, It's probably not challenging enough for me. So I started looking elsewhere and after a while I was supposed to start working for Kiwi. I don't know if you know Kwi.com this company, but last minute the deal was off. Doesn't matter the reasons, but they were legit. So I was like, okay, but I'm already one leg out of the door. I don't want to stay where I was. So let's search for something else. I opened a jobs portal and the first company there was Apify. I was like, what is this? Oh, web scraping. That's cool. I thought, this is illegal, I want to apply. 

Henry Ng - 00:05:14: 

Yeah, it's something different, isn't it? I feel like especially back in 2018, it was the start of the big movement to web scraping and data collection. So it did seem a little bit illegal back then. 

Ondra Urban - 00:05:27: 

Yeah, exactly. It really piqued my interest. And I love Apify. I'm super glad that guy hired me. I was surprised in a way, because with this team at that point was really small, so they were looking at senior developers. But I guess that my legal background convinced them that I should join. And it's been right ever since I started as a developer. Then I got the responsibility to take care of our SDK, which now you may know is Crawley. It's 7000 stars web scripting library for JavaScript. And as I was going, I got my own team. Then I was leading teams of teams and so on, and now I want the COO. And it's been a crazy ride in a way. And I wouldn't trade Apify for anything. It's the best company I've worked for, really. 

Henry Ng - 00:06:17:

It's great to hear. I mean, it sounds like even though you didn't go into project management, now that you're COO, you're kind of the manager of all the project managers going on in the company now. So I feel like it's gone full circle for you. 

Ondra Urban - 00:06:29: 

Yeah, in a way it did.

 Henry Ng - 00:06:31:

Exactly. It's an amazing journey, especially going from like you said, taking apart and kind of dissecting these problems to really kind of managing and kind of solving them now, especially in those COO roles and through your web development roles. And obviously we did some background kind of conversations with you just to find out a little bit more about yourselves. And one of the main things that really stuck to us in our pre-podcast news for you was the idea that you kind of came across and said that you enjoy finding weak spots and bottlenecks and systems and unlocking their full potential. So what draws you in about those bottlenecks and kind of tell me about those encounters that you might have on a regular basis that you see?

Ondra Urban - 00:07:16:

Yeah, I'd say with every system there's something wrong, right. You just maybe don't know about it, but maybe I'll start from a different note or different end of the problem. And when you look at software engineers, or when you talk to software engineers, most of them are builders. They want to build things they love, when they can ship stuff to production, when something they create is used by other people and so on. I'm kind of weird or special in a way that I don't like this that much. For me, building stuff is not that important. I don't know why, but I always like debugging more like finding issues, finding problems and then solving them. Maybe it's because I like puzzles, I don't know. Maybe because you find that building gets a bit repetitive because after you know how to build it and then you have to build it again. Like for example, when you're working for a company that builds mobile apps and you just do one mobile app after another, then might get bitter, repetitive. The truth is I don't know because I've never done that. But it's how I feel about it. At least how I try to imagine that I trend towards debugging more than building or why or what motivations of mine could be. But in reality, I don't know. I just know that it works for me. I love it. I love when there's a problem and you can really dive deep into the problem and look at it from all different angles, figure out what's going on there, and then slowly hack away in it and suddenly then you solve it and you have this adrenaline. Or dopamine ridership like yes, finally, I guess maybe with building, it's a bit more incremental, just step by step thing, whereas with fixing bugs, it's like you get all the dopamine in the end, of course. 

Henry Ng - 00:09:01: 

What are some of the weakspots that you normally find in systems and where do these bugs come from? When we're talking about systems like Apify.

 

 

Ondra Urban - 00:09:10:

I would say this really changed as my career progressed. So when I was the sole developer building Apify SDK or Crawley, now it was mostly issues that I created myself. I would build and build and build, and then suddenly I would spend two weeks trying to figure out what I built wrong. So there's one kind of like engineering bottlenecks. I really love to dive deep into those and work with debugger and trying to trace everything and inspect everything and see what's actually going on in the system. So that's one thing. But over time, obviously, those bottlenecks or weak spots have changed. Once I moved into management of managers, that means like, I had teams which actually had their own managers and I was managing them. Suddenly it's more about communication and making sure that the teams work well together, that they have everything they need, that the people are happy and that they feel they are making a dent in the company strategy, or that they progress and that they feel they have a good future in front of them. And now as a COO, it's a bit different because now I don't have many direct reports. I'm not a manager of people, I'm more manager of situations or projects or problems. And I have other people that do the work and they're excellent. They are very good at their jobs. But sometimes it's hard for marketing to grasp the full scope of the engineering problem or for engineers to grasp how important it is to communicate this feature to customers and communicate it in a way that the customers will understand because they are not living 8 hours a day in this code base. And obviously they need a bit more of a nudge nowadays. The biggest bottleneck I saw the biggest bottlenecks I saw are really communication issues within teams, multi team projects that need some higher level owner who kind of makes the hard decisions at the end or who just clears the path for projects to happen. Because it could be very difficult for some things that require teamwork of three to five teams to actually happen because obviously, all of them are busy, right? And if one team finishes their work and the other team is busy, and they put it at the end of their backlog three months later, then the next team puts it at the end of their backlog three months later. So I'm kind of running around and be like, okay, hey, guys, could you please push this?

Henry Ng - 00:11:55:

It kind of sounds like you're the glue to applied, where you're removing those blockers and getting teams communicate. And that's obviously an enjoyable thing, especially when you're so focused. Previously on Fixing those bugs and fixing those issues from a program level, now you're doing it at a company level. One of the main questions I want to ask you is what attracted you to Apify in the first place? Beyond just it was the first job that came up on the job portal web scraping. Was there anything else about the company that really drew you to it?

Ondra Urban - 00:12:24:

Yeah, I think the fact that I found them as the first company was just pure luck. But anyway, I think even if they were fifth or 6th, I would ended up here because it's just the founders, they had really great vibe when we were interviewing. And I kind of like working with people who think in a similar way as I do. So it's just like, let's not stress too much about details. Let's just focus on shipping, focus on doing our best jobs. And when I started, we were working out of one room in, like, a coworking center. We could not have a coffee machine because it was too noisy. So we were making filtered coffee because the kettle is not that noisy. And it was just great. And I think we kind of kept this culture or atmosphere in the company ever since. Like, on Friday, we had a 7th birthday party. I've actually been in at the Apify only for four years or four and a half years. But this was the 7th birthday party, and it was just great to see all the people who maybe are no longer with the company, but they would show up anyway and have a good time and feel like, hey, when I was here, I really loved this. It's very special. Or when we have people leaving the company and they say, you know, the next company is good. I love the challenges, I love the work, everything's cool. But the culture here is something else. And even if I comment on it from the position of a CEO and be like and we have interviews with employees, so we do some surveys, and it just seems that it is the culture that the people here love and why they don't want to leave and why they stick.

Henry Ng - 00:14:12:

I mean, it sounds like Apify is really focused on accessibility from a company level, and I feel like that rings true into your overall company mission. The idea that you want to make web more programmable, which makes it more accessible. How do you aim to make it more programmable and accessible without unleashing utter chaos to everyone who wants access to it and especially everyone who is using the Internet on the web as well? 

Ondra Urban - 00:14:37: 

It's an interesting way of putting it, I think unleashing chaos is important for any change and for progress. So we love unleashing chaos. But have you ever played Buttersgate or Wizards of the coast of things? Yes. So we're like chaotic good, we sometimes make mistakes, but we try to fix them very fast. But in the grand scheme of things, we want to be good and we want to be helpful and we want to be ethical and we want to do good in the world. And our approach to web scraping is really there's just this vast amount of data, this huge database of information called the Internet. And it just makes sense for people to have access to it, right? And to have programmatic access to it so that they can process the vast amounts of data, not only look at it, but actually get some insights from it and do something good for the world. So by unleashing a little bit of chaos, we try to make the world a better place in the end.

Henry Ng - 00:15:39:

So kind of controlled chaos so you can challenge the social norms and social stigmas that aren't normally in place. I think that's a great way to do it. And you don't really see that in tech. You see that in the world overall. And it's good to see companies like Apify taking that forward. And one of the programmable aspects, obviously, that you as a company go for is scaling processes, robotize tedious task and kind of speed up that workflow with flexibility and automation. In what ways have these processes being applied and how do you manage to kind of maintain an ethical standard? Really? Because board of data warehouse collection ethics is a big thing. Especially since this podcast is all about ethical data. It would be great to know a little bit more about how you maintain ethical standards when you're doing the work that you do on Apify.

Ondra Urban - 00:16:25:

Yeah, I guess there are two prongs to this because Apify is a platform, right? So anybody can come and do whatever they want here. Basically it's like AWS or Google Cloud. They don't monitor what software you are running on the servers. They are your servers. You purchase them and you can do whatever you want with them. But if somebody complains, then you have trouble, right? So this is how we approach our business as well on the platform side and there's really no other way of doing it. In legal terms, it's called safe harbor and it really is the way that any kind of platform can work. It basically says you're not responsible for the actions of your users unless you have known about them previously. So we're relying on safe harbor to actually be able to do this business. But then whenever someone reaches out to and tells us that there's something wrong, we try to be extremely helpful and supportive and try to figure this out. Because we understand that you cannot build a good name for web scraping or data extraction by doing it the shady way. Right. We want to convince everybody, and we want to be helpful and understandable. So if somebody comes at us and says, you know, you shouldn't be doing this, we try to calmly explain, okay, those are the boundaries. We can be doing this. If we find that something is wrong, like, I don't know, somebody is using our platform to abuse a certain website, can we just disable that user? It doesn't happen often. Luckily, we don't have that many of the users. But it has happened once or twice, and we have done that. And then there is the business that we do on our own, like, where we have customers that we scrape websites for them, that we build custom solutions to pick the large scale ones for big companies. And there we do really stringent legal analysis and project analysis. We look at how many requests we can actually send to the website so that we don't overload it. But there are many different tools that you can use. Obviously, you have no way of knowing how powerful their servers are, but there are places like similar web or other resources on the Internet that you can use to kind of gouge this. And we're always very careful around personal data, around copyright data. We have a lot of legal opinions around that. But in the end, because I could continue forever about this, but in the end, it really is about having this sense of honesty or this idea of doing good in the end. Or maybe I think Google had this some part of their mission, and they do no harm. Exactly. Yeah. So really do no harm. Like, use your brain and think if what you're doing is a win-win situation, or if you're just unloading so many requests on a website that it's just slower for all the other customers. Or if you download all pictures from some website and then republish them on a different website, that's not cool. That's piracy. So try to think, am I actually making the world better by doing this? And I think our projects are like, we have this project for an American nonprofit called Thorn, and for them, we're scraping data or images and videos from adult websites and escort websites. And it's for child trafficking, because Thorn have software that's used by law enforcement in the US. And basically they can use images or videos of kids, typically teenage girls, and upload them into the system. And thanks to the system having a vast database of images and videos that we supplied from those escort and adult sites, they can actually find the kids. And over the four years we've been working with them, they were able to identify 17,000 of them and now it's maybe more because they have not updated their website in the past few months. It could be like, what, 20K now? It's crazy and it's just scraping for good. It's the power of web scraping where you can use public data for good things and we're extremely proud of use cases like that. 

Henry Ng - 00:20:53: 

I think that's a great word. But the idea that scraping data is at such a gray area, even the use of proxies is such a gray area in the world market. And not only are you maintaining ethical standards from your kind of KYC side and knowing your customers and stopping anything that is malicious from that side, but you're also actively working to put ethical standards into the world as well by your work with Thorn. And yeah, I mean, now we're on the topic of some of the clients you're working with. I know that recently you've done some research with Boston College and you received a grant to look into property tax in Massachusetts area. With that type of research, how did your work with them go? And then what did you achieve? And were there at any point between what you were trying to do and federal property laws? Basically overall, how did that look for you as a company to work with such a prestigious college?

Ondra Urban - 00:21:41:

Yeah, it was actually a researcher from the college that reached out to us because as I said, we're a platform, so unless somebody reaches out, we typically don't know what they're doing. And we're always super happy when somebody comes back and it's like, oh, I have this interesting project, could you help me with this? And we have people who can help, so it's always a good idea to reach out. Customer success is extremely important for us, even though it might sound cliche or whatever, but it's like especially with platforms like ours where it takes a lot of effort to on board. It's not like Facebook where you log in and click three times and you're there, right, but you have to actually build something and you have to understand what's going on. So we really try to work with our customers and help them on board. So this was the researcher reaching out to us and asking, hey, I'm trying to do this thing, could you help me? And we were really happy to help because it was an interesting use case, to be honest. You were asking about federal laws. Luckily there aren't any federal laws banning web scraping. That's a good thing. Obviously that would be a bit complicated for us, but this was all public data and there were no intentions on the researcher side to go anywhere beyond that. Plus the US. Law is very Indian for researchers in terms of using fair use doctrine and other legal I don't want to say loopholes, but I'm not a native English speaker so I have a bit of a trouble finding the correct word. 

Henry Ng - 00:23:05: 

No, we'll go with loopholes. It's kind of that gap in the law. 

Ondra Urban - 00:23:09: 

So research usually has a lot of exemptions for being able to do what maybe companies would not be so easily able to do. So from an ethical or legal standpoint, this was a very easy project. There were no issues.

Henry Ng - 00:23:24:

Amazing. And just kind of moving on from that. I know that I've got a legal background in working on cases, but in the case of, say, HiQ versus LinkedIn within this case, what we understand is that it seems to be a major victory for LinkedIn themselves. What do you think the implications of this case are for, like, future web scraping and web scraping tools?

Ondra Urban - 00:23:45:

Yeah, it's a landmark case, really, like super important one. And the important thing is that there was a new decision a month ago, maybe three weeks ago, but this only decided certain parts of the case because for web scraping, or people who follow web scraping might have noticed that over the past four years there have been numerous articles and blog posts and news about web scraping finally being legal. Right, so that's not entirely the case. Web scraping is neither legal or illegal. It's how you use it and what you scrape and it's very complex and there are certain things that you are most likely to be able to scrape and then there are other things that you would need very specific conditions to be able to scrape them. So if you're scraping for a living, or if you're scraping for a really big project, then it's always good to consult a legal representative, legal counsel. But back to HiQ, so originally there was a preliminary injunction issued and it was basically saying that it's okay to scrape publicly available data. Obviously LinkedIn appealed to this. Then it went to the Supreme Court to decide other things and then back to, I think it was the 9th Circuit who confirmed the earlier decision and kind of scoped it, that scraping public data is unlikely to breach CFAA, which is Computer Fraud and Abuse Act. So it's not criminally punishable if you scrape the public data, which is translated by many people to web scraping is now legal. But there are other things that you need to be careful about when you're scraping in terms of use, for example. And that's actually what was the subject of discussion in the second or the latest HiQ and LinkedIn decision, because the court basically said, and this was not surprising at all, I would not frame it as a major victory by LinkedIn. Obviously, their VR representatives, they had a very easy job doing that. But the reality is the case will still go to trial, most likely. So this was only a summary judgment discussing certain things, but what the summary judgment said is that, yes, the terms of use, that there was an agreement between HiQ and LinkedIn based on terms of use and that those terms of use are enforceable. That's super important. And also that damages can be claimed if LinkedIn can prove that some damages occurred, which could be very difficult for LinkedIn. But I would say I was a bit more I was a dreamer in a way and I was hoping that the court would go a bit further and they would say, look, this is public data, so you cannot ban scraping of public data by terms of use. And they would say, this clause in the contract is not enforceable because it's like free speech or something. I'm not a US lawyer, so don't quote me on this, but I would expect they would pull a trick like this out of their hat and because this is the 21st century, right, and say, okay, the data is public. You have made it public. Because LinkedIn, you have decided to make it public. You wanted it public, so now you can't complain that somebody downloaded this public data. It wasn't like that. So the dreamer in me was a bit sad, but the reality is there was a contract in place and HiQ did not honor it. So I understand why it looked like this, why the decision was made this way. We will still be following the case. But if I should just sum this up, there are two important things. Scraping publicly available data is most likely not punishable by criminal punishment. But if you have any sort of contract with the company that you're scraping, you have to be really careful what's in the contract and how they could possibly enforce it. Plus, I forgot and that's kind of interesting as well. LinkedIn was complaining that HiQused fake accounts and contractors and the court decided as well that the use of fake accounts constitutes breach of contract anyway, even though you're just creating the fake accounts for the purpose of accessing data that may or may not be public in this specific case, but also made quite an important decision on that. And that every scraping company or everybody who's scraping should be wary if they're using fake accounts and probably should avoid them if at any point possible. We're trying to avoid them at all costs at this moment. Not that we weren't before, but it's just something that you have to take into account when you're running a company.

Henry Ng - 00:29:02:

So it sounds like it's more a starting precedent than a solid landmark that's going to dictate how everything in the scraping space works moving forward, which I completely agree is that starting point that we're looking at. And with more innovation and more trend changing in the scraping space, I feel like it's a good place to build from. Just on that topic, obviously currently monitoring the scraping space and all the innovation going on, what. Kind of major trends do you see and what type of innovations do you see coming out of not only that case, but out of the market in general, moving forward over the next, say, six months to a year?

Ondra Urban - 00:29:38:

Yeah, I don't think there's going to be any major improvements in the area. Six months or a year. I think AI is going to play a big role, but I think it's still a bit early for it to completely replace programmers. AI is really good when you need huge amounts of data and you don't care that much about the quality of the data. From our experience, AI can take you like 80%, 85% there, but for some use cases, it's totally fine and it saves you a lot of money building custom scrapers. Right. But for other use cases, it's not fine because you need precise data and AI is just not there yet. Maybe for some websites it is, but for some websites it isn't. Therefore you can't use just AI, right? Because you still have to have people who at least check the data and then if it doesn't work, what do you do? You build a custom scraper, maybe, I don't know. So we're not employing AI too much at this moment, although we're trying to and playing with it. We see some of our competitors going a bit more, putting a bit more chips onto this technology. We're waiting it out a bit like Apple waits until the technology matures, releases it.

Henry Ng - 00:30:55:

Everyone else has already released it, but yeah, exactly. 

Ondra Urban - 00:31:00: 

But it is the technology of the 21st century and I think it's going to make major impacts in web scraping as well. What we're trying to do instead is we're trying to focus on developer experience, to make it as easy, as convenient, as fast for developers to build scrapers, because we think that, at least for the foreseeable future, humankind was the new developers. And developers are a bit picky about what they use and they are very opinionated. And so we want to give them the best tools possible and we want them to love them and we want them to share their excitement with other developers. So that's our bet.

Henry Ng - 00:31:40:

What better person to do it than the COO who's got a software development background?

Ondra Urban - 00:31:44:

Yeah, thank you. Hopefully I'll manage to make it happen. But, yeah, AI is something that I see a big future for, but it might take more than six months or a year. What I would actually love from AI is to fix broken scrapers more than actually build. Now, what you get is you have those AI extractors, you give them HTML and they parse the data. If it's a product site or an article, it can find the data and kind of give it to you without you having to write the selectors. That's good. But the quality is good for something, not so good for other things. But what I would love to see is that the AI could monitor and something we played with, but it's quite a complex problem. Basically, a developer would build a scraper, right, a custom one. And then the AI would instead of trying to figure it out on some generic model, you would train it based on the data from this scraper that a human built. And then using this data, it could kind of automatically fix itself because whenever the selectors would change, it would have this historical trend of data to find and to see. And it could compare and maybe find the new selectors with much higher success rate than if you're just using a generic pretrained model that needs to do it for every single website in the world. And this is something that I don't know if anybody's building. If they are and they're listening to this podcast, please let me know. I would love to try. But it feels very exciting because the maintenance of the scrapers is something that takes a lot of your time and energy of developers. And even if it worked only in 50 or 70% cases, it would still be amazing because now you just have to fix all of them, right? So if you only have to fix 50% of them, that would already be a huge upgrade.

Henry Ng - 00:33:45:

Exactly. If anyone is listening to the podcast, it sounds like Ondra is extending a partnership out there to the water market if anyone wants to work with AI and scrapers. But that's great information. It's great to see the insight and direction that you think that scraping space is taking and it's great to get knowledge from you in terms of web scraping overall as well. We've got a couple of other questions, but these aren't so much focused on the industry and the market, they're really focused on you, just so we know you a little bit better. And the first question is, if you could take anyone in the world of data out for lunch, who would you take? 

Ondra Urban - 00:34:18: 

This would be super hard for me to answer, but thankfully this is a professionally prepared podcast. So I got the questions beforehand, right. And I was really thinking about this and I would take Edward Chen, the judge of the HiQ v LinkedIn case, because I really would want to hear his thoughts about the whole scraping of public data and monopolies on data by big platforms and how they should not be able to choose who they allow to access public data. Like if they want say that this data is public, then they should not be able to revoke access to some specific person just because they want to. Right. When I first read it, the decision really piqued my interest, the language used and how we talked about it. And it would be super interesting to have a chat again.

Henry Ng - 00:35:14:

 If for some random stars aligning moment that he's listening to this podcast, Ondra would like to take you out for lunch, hopefully that might be something that can happen beyond that. Like, what piece of software do you use on a daily basis? Do you think you couldn't live without? It could be an application. It could be software that used for work. What one thing could you not live without?

Ondra Urban - 00:35:35:

I couldn't live without my Apple devices, I guess, like iOS and macOS and stuff. But if I need to pick one, it's GitHub. It's just so great. I don't know how they do it, but we copy from them, most of us do in the world. They just have great designers and great approach and they're just really, really good at what they do. I was afraid that the Microsoft acquisition is just going to kill GitHub and it's going to be terrible. But for some weird reason, microsoft actually improved it even more. So I'm super excited about GitHub and it's just an amazing software to work with. It's our inspiration, really. 

Henry Ng - 00:36:21: 

It definitely did feel like Microsoft managed to supercharge GitHub in a way that I never thought they would and really driven them to new heights. And I think that's always going to be a name for tech companies and I'm glad that you feel the same way in that. The final question we have is when have you used data to solve a real world problem? It can be the biggest one that you've ever had, or the smallest one that might have helped you decide what you wanted for breakfast. What type of data have you used to solve a real world problem?

Ondra Urban - 00:36:47:

I'm going to look at this question from a bit of a different angle because I'm using data every day to solve real world problems, even when I was just an engineer. Now, as a COO, I have a lot of Excel Sheets or Google Sheets actually, and we have Bi analytics or whatever, and we're just looking at data all the time. But I think the one that makes sense talking about in the world of scraping is that once my girlfriend wanted to purchase a dress, I think it was, and it was out of stock, but she really, really wanted it. It was in the might have been the first month, actually. It was an Apify. So I wrote a scraper that basically went to the website every ten minutes, kind of checked if the size is already there, and once it was there, it would send her an email with a link and everything so she could just purchase it. And it actually worked. Like we were driving, I think somewhere, I don't know, we were driving to Germany, because that's what I know. I guess we were maybe driving to Austria for snowboarding in the Alps or something like that. And the email arrives, which you just pulled up the laptop. It's like, oh, my God. Bought it. And she was super happy. And I was like, wow, it's such a tiny thing web scraping can help you with. So that was cool.

Henry Ng - 00:38:07:

I'm really hoping you got some good brownie points for doing that. And the second thing is, I definitely need that scraper for my fiance because she'll be in the same mindset. She was always waiting for things to come back into stock in stores. But yeah, that's a really useful tool and definitely something you could definitely market to a lot of people, I feel.

Ondra Urban - 00:38:25:

Yeah, I was thinking about it. It was really customized for this one website, so maybe the AI people can come in and actually make it work for all shops in the world without having to maintain it. But it's true that we already have a thing in the app. If I store, which is like an app store with various scrapers and it's called Content Checker, and you can basically select a field on the website and it will automatically send you a notification when this changes. So you can monitor like price changes or maybe you could monitor even if some size is back in stock or stuff like that. So maybe check it out, maybe it will actually work for you. And this one works on all kinds of websites, you don't have to modify it for a specific website. Brilliant.

Henry Ng - 00:39:12:

Well, I'll definitely take that on board. Unfortunately, that is all the time we have for today on this podcast. I want to thank Ondra for joining us. It's been an absolute pleasure to get to know you, get to know applied and kind of your views on the data world. And yeah, we'll hopefully have you on sometime in future to pick your brains in more detail as well. Thank you very much.

Ondra Urban - 00:39:31:

Yeah, thank you for inviting me. It's been a pleasure. I really loved it. It's an interesting experience to be on a podcast and I'm looking forward to hearing myself.

Henry Ng - 00:39:41:

We'll make sure we give you the best edit possible, but thank you very much for your time today.

Ondra Urban - 00:39:45:

Thank you. 

Henry Ng - 00:39:47:

Ethical Data Explained is brought to you by SOAX, a reputable provider of premium residential and mobile proxies, a gateway to data worldwide at scale. Make sure to search for Ethical Data Explained in Apple Podcasts, Spotify and Google Podcasts or anywhere else podcasts are found and hit subscribe so you never miss an episode. On behalf of the team here at SOAX, thanks for listening.

Read full transcript

Ondra Urban

Ondra's been a lawyer and an engineer. Now he is the COO of Apify, but prefers to think of himself as the chief debugging officer who enjoys finding weak spots and bottlenecks in systems and unlocking their full potential. Apify’s mission is to make the web more programmable; his mission is to make Apify the well-oiled machine that can achieve that goal.

Contact author