Need advise on how to best use HttpClient.GetAsync() to make many requests at the same time

Hi!

Need advise on how to best use HttpClient.GetAsync() to make many requests at the same time. That is what it's all about.

I'm writing a web crawler, that checks for broken links. I have read up on and understood most of what async/await is about. I had the crawler working without async/await (with HttpWebRequest), and back then, this is how it worked:

I had a class Crawl that corresponded to one web site's (i.e. domain, not page) crawl. It visited 1 page at a time from the web sites domain, saving the links it visited in a HashSet, and so on. I set off like 30 threads, each with a Crawl instance, at a time, and it worked decently.

But I wanted to crawl 200 web sites at a time, and figured it should be possible. At least, the program would not be limited by the CPU, since most of the time is spent waiting for web responses. But I heard here on reddit, among other places, that the HttpWebRequest does not use the CPU efficiently, and was advised to use async/await and HttpClient.GetAsync().

However, I'm not sure how to structure the whole program so that it becomes simple and efficient. Waiting for a Task's result should block efficiently, right? As if waiting for a lock or sleeping. But I bet having a bunch of threads waiting for Task results is not the best option. I tried that, since it's an easy change, but it didn't seem to work much better than before.

I thought about using 1 main thread to fire off all the web requests, and then having the await continuation deal with the processing required, and then somehow mark the web site as ready for another request. Is there some kind of selectable channels, as in Go, in C# that I can use?

By the way, is there a way to discover if the thread is truly inactive/yielding/sleeping/parked or whatever, or just busy doing nothing? Because when profiling the old HttpWebRequest, the profiler said that the CPU usage was at 0% most of the time (for 1 web site), but trying to run many Crawls at the same time does not work anyway. I mean, the first 30 or so start fine, but then the starting of yet more Threads start to slow down significantly. And starting threads that just wait for a lock can be done at a rate of about 200 per second, or something. So, to me, that is reason to believe that the CPU is not being used efficiently.

I hope you got an idea of what I'm trying to do. I appreciate any help.

by tufflax via /r/csharp

Leave a Reply