scrapy-redis组件中如何实现的任务的去重?

scrapy-redis组件中如何实现的任务的去重?

请先 登录 后评论

1 个回答

李奡 | 奈学教育 - 奈学教育 | 讲师
擅长:大数据
a. 内部进行配置,连接Redis
b.去重规则通过redis的集合完成,集合的Key为:
   key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
   默认配置:
      DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'
c.去重规则中将url转换成唯一标示,然后在redis中检查是否已经在集合中存在
   from scrapy.utils import request
   from scrapy.http import Request
   req = Request(url='http://www.cnblogs.com/wupeiqi.html')
   result = request.request_fingerprint(req)
   print(result)  # 8ea4fd67887449313ccc12e5b6b92510cc53675c
scrapy和scrapy-redis的去重规则(源码)
1. scrapy中去重规则是如何实现?
class RFPDupeFilter(BaseDupeFilter):
    """Request Fingerprint duplicates filter"""

    def __init__(self, path=None, debug=False):
        self.fingerprints = set()
        

    @classmethod
    def from_settings(cls, settings):
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(job_dir(settings), debug)

    def request_seen(self, request):
        # 将request对象转换成唯一标识。
        fp = self.request_fingerprint(request)
        # 判断在集合中是否存在,如果存在则返回True,表示已经访问过。
        if fp in self.fingerprints:
            return True
        # 之前未访问过,将url添加到访问记录中。
        self.fingerprints.add(fp)

    def request_fingerprint(self, request):
        return request_fingerprint(request)

        
2. scrapy-redis中去重规则是如何实现?
class RFPDupeFilter(BaseDupeFilter):
    """Redis-based request duplicates filter.

    This class can also be used with default Scrapy's scheduler.

    """

    logger = logger

    def __init__(self, server, key, debug=False):
        
        # self.server = redis连接
        self.server = server
        # self.key = dupefilter:123912873234
        self.key = key
        

    @classmethod
    def from_settings(cls, settings):
        
        # 读取配置,连接redis
        server = get_redis_from_settings(settings)

        #  key = dupefilter:123912873234
        key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(server, key=key, debug=debug)

    @classmethod
    def from_crawler(cls, crawler):
        
        return cls.from_settings(crawler.settings)

    def request_seen(self, request):
        
        fp = self.request_fingerprint(request)
        # This returns the number of values added, zero if already exists.
        # self.server=redis连接
        # 添加到redis集合中:1,添加工程;0,已经存在
        added = self.server.sadd(self.key, fp)
        return added == 0

    def request_fingerprint(self, request):
        
        return request_fingerprint(request)

    def close(self, reason=''):
        
        self.clear()

    def clear(self):
        """Clears fingerprints data."""
        self.server.delete(self.key)
请先 登录 后评论