Go Concurrency in Go: Implementing a Thread-Safe Web Crawler
In the official tutorial of Go language, Web Crawler practice is a very classic concurrent programming case. It requires us to modify a basic recursive crawler so that it can crawl the URL in parallel, while ensuring that the same URL is not repeated.
This article will take you to analyze the original code problem step by step, and use Go’s sync package to achieve an efficient and safe concurrent crawler.
1. The problem of the initial code
The original Crawl function, while simple logic, has two fatal flaws in practical applications:
1. Serial execution: The code handles sublinks by calling Crawl recursively. This means that the program will not start processing the next sibling node until the previous page and all the subpages are crawled. This completely wastes the performance of modern multi-core CPUs, and does not reflect the concurrency philosophy of the Go language “don’t communicate through shared memory, but share memory through communication”.
2. Repeat crawl and an infinite loop: If webpage A is linked to B, and B is linked back to A, the original recursive logic will fall into an infinite loop. Even if there is no loop, when multiple pages point to the same resource, it will lead to duplicate network requests, waste of bandwidth and processing time.
The core solution
In order to solve the above problem, we need to introduce two key mechanisms:
1. Concurrency control: Use goroutine to initiate network requests in parallel.
2. Status synchronization: Use mutex locks (Mutex) to protect the shared “Accessed” record table to ensure concurrency security.
3. Detailed explanation of code implementation
The following is a complete code implementation after complete improvement:
package main
import (
"fmt"
"sync"
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher) {
// visited 用于记录已经访问过的 URL,避免重复爬取
visited := make(map[string]bool)
var mu sync.Mutex // 互斥锁,保护 visited 的并发访问
var wg sync.WaitGroup // WaitGroup 用于等待所有爬取任务完成
// crawl 是内部递归函数,实际执行爬取逻辑
var crawl func(url string, depth int)
crawl = func(url string, depth int) {
defer wg.Done() // 确保每个 crawl 调用结束时调用 Done,通知 WaitGroup
if depth <= 0 {
return
}
mu.Lock() // 加锁,保护 visited 的访问
if visited[url] {
mu.Unlock() // 已访问过,解锁并返回
return
}
visited[url] = true // 标记当前 URL 已访问
mu.Unlock() // 解锁
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q\n", url, body)
for _, u := range urls {
wg.Add(1) // 为每个新的爬取任务增加计数
go crawl(u, depth-1) // 启动新的 goroutine 爬取子 URL,深度减 1
}
}
wg.Add(1) // 为初始 URL 增加计数
go crawl(url, depth) // 启动爬取初始 URL 的 goroutine
wg.Wait() // 等待所有爬取任务完成
}
func main() {
Crawl("https://golang.org/", 4, fetcher)
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"https://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"https://golang.org/pkg/",
"https://golang.org/cmd/",
},
},
"https://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"https://golang.org/",
"https://golang.org/cmd/",
"https://golang.org/pkg/fmt/",
"https://golang.org/pkg/os/",
},
},
"https://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
"https://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"https://golang.org/",
"https://golang.org/pkg/",
},
},
}
4. Analysis of key technical points
1. Why need sync.mutex?
In Go language, the map type is not concurrent safe. If multiple goroutines try to read and write the same map at the same time, the program will directly panic errors. Therefore, we are checking Visited[url]and settings visited[url]= true between these two operations, atomicity must be guaranteed. With mu.lock() and mu.unlock(), we ensure that only one goroutine can modify or check this map at a time.
2. The role of sync.waitgroup
When we start multiple goroutines, the main function (main) does not automatically wait for them to end. If there is no waitgroup, the main function may exit immediately after the first request is issued, resulting in forcibly terminated the subsequent sub-tasks.
wg.add(1): Add 1 to the counter before starting a new goroutine.
defer wg.done(): Decrement the counter by 1 at the end of goroutine execution.
wg.wait(): Block the main thread until the counter is zero, that is, all tasks are completed.
3. The combination of closure and recursion
We define a local function, Crawl, inside Crawl. The advantage of this is that it has free access to external visited, mu and wg variables without passing these parameters layer by layer. At the same time, with Go Crawl(u, depth-1), we convert recursively to concurrent task distribution.
5. Summary
Through this exercise, we not only implemented a simple crawler, but also more deeply understood several core concepts in the Go concurrent model:
– Improve the efficiency of I/O-intensive tasks with Goroutine.
– Protect shared states with Mutex and avoid data races.
– Coordinate the life cycle of multiple coroutines using WaitGroup.
In the actual production environment, you may also need to consider adding timeout control, error retry mechanism, and limit the maximum number of concurs (using the buffer with buffer as the semaphore), but this basic version already covers the core logic of concurrent programming.