Repository: lenarsaitov/cianparser Branch: main Commit: 236352a200b0 Files: 22 Total size: 103.0 KB Directory structure: gitextract_5z1o6h1o/ ├── .github/ │ └── FUNDING.yml ├── .gitignore ├── LICENSE ├── README.md ├── cianparser/ │ ├── __init__.py │ ├── base_list.py │ ├── cianparser.py │ ├── constants.py │ ├── definers/ │ │ ├── __init__.py │ │ ├── definer_cities_id.py │ │ └── definer_metro_id.py │ ├── flat/ │ │ ├── list.py │ │ └── page.py │ ├── helpers.py │ ├── newobject/ │ │ ├── list.py │ │ └── page.py │ ├── proxy_pool.py │ ├── suburban/ │ │ ├── list.py │ │ └── page.py │ └── url_builder.py ├── setup.cfg └── setup.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/FUNDING.yml ================================================ # These are supported funding model platforms github: [lenarsaitov] ko_fi: lenarsaitov ================================================ FILE: .gitignore ================================================ /venv/ /build/ /dist/ /cianparser.egg-info/ __pycache__/ ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2023 Lenar Saitov Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ ### Сбор данных с сайта объявлений об аренде и продаже недвижимости Циан Cianparser - это библиотека Python 3 (версии 3.8 и выше) для парсинга сайта [Циан](http://cian.ru). С его помощью можно получить достаточно подробные и структурированные данные по краткосрочной и долгосрочной аренде, продаже квартир, домов, танхаусов итд. ### Установка ```bash pip install cianparser ``` ### Использование ```python import cianparser moscow_parser = cianparser.CianParser(location="Москва") data = moscow_parser.get_flats(deal_type="sale", rooms=(1, 2), with_saving_csv=True, additional_settings={"start_page":1, "end_page":2}) print(data[0]) ``` ``` Preparing to collect information from pages.. The absolute path to the file: /Users/macbook/some_project/cianparser/cian_flat_sale_1_2_moskva_12_Jan_2024_21_48_43_100892.csv The page from which the collection of information begins: https://cian.ru/cat.php?engine_version=2&p=1&with_neighbors=0®ion=1&deal_type=sale&offer_type=flat&room1=1&room2=1 Collecting information from pages with list of offers 1 | 1 page with list: [=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>] 100% | Count of all parsed: 28. Progress ratio: 50 %. Average price: 45 547 801 rub 2 | 2 page with list: [=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>=>] 100% | Count of all parsed: 56. Progress ratio: 100 %. Average price: 54 040 102 rub The collection of information from the pages with list of offers is completed Total number of parsed offers: 56. { "author": "MR Group", "author_type": "developer", "url": "https://www.cian.ru/sale/flat/292125772/", "location": "Москва", "deal_type": "sale", "accommodation_type": "flat", "floor": 20, "floors_count": 37, "rooms_count": 1, "total_meters": 39.6, "price": 28623910, "district": "Беговой", "street": "Ленинградский проспект", "house_number": "вл8", "underground": "Белорусская", "residential_complex": "Slava" } ``` ### Инициализация Параметры, используемые при инициализации парсера через функциою CianParser: * __location__ - локация объявления, к примеру, _Москва_ (для просмотра доступных мест используйте _cianparser.list_locations())_ * __proxies__ - прокси (см раздел __Cloudflare, CloudScraper, Proxy__), по умолчанию _None_ ### Метод get_flats Данный метод принимает следующий аргументы: * __deal_type__ - тип объявления, к примеру, долгосрочная аренда, продажа _("rent_long", "sale")_ * __rooms__ - количество комнат, к примеру, _1, (1,3, "studio"), "studio, "all"_; по умолчанию любое _("all")_ * __with_saving_csv__ - необходимо ли сохранение собираемых данных (в реальном времени в процессе сбора данных) или нет, по умолчанию _False_ * __with_extra_data__ - необходимо ли сбор дополнительных данных, но с кратным продолжительности по времени (см. ниже в __Примечании__), по умолчанию _False_ * __additional_settings__ - дополнительные настройки поиска (см. ниже в __Дополнительные настройки поиска__), по умолчанию _None_ Пример: ```python import cianparser moscow_parser = cianparser.CianParser(location="Москва") data = moscow_parser.get_flats(deal_type="rent_long", rooms=(1, 2), additional_settings={"start_page":1, "end_page":1}) ``` В проекте предусмотрен функционал корректного завершения в случае окончания страниц. По данному моменту, следует изучить раздел __Ограничения__ ### Метод get_suburban (сбор объявлений домов/участков/танхаусав итп) Данный метод принимает следующий аргументы: * __suburban_type__ - тип здания, к примеру, дом/дача, часть дома, участок, танхаус _("house", "house-part", "land-plot", "townhouse")_ * __deal_type__ - тип объявления, к примеру, долгосрочная аренда, продажа _("rent_long", "sale")_ * __with_saving_csv__ - необходимо ли сохранение собираемых данных (в реальном времени в процессе сбора данных) или нет, по умолчанию _False_ * __with_extra_data__ - необходимо ли сбор дополнительных данных, но с кратным продолжительности по времени, по умолчанию _False_ * __additional_settings__ - дополнительные настройки поиска (см. ниже в __Дополнительные настройки поиска__), по умолчанию _None_ Пример: ```python import cianparser moscow_parser = cianparser.CianParser(location="Москва") data = moscow_parser.get_suburban(suburban_type="townhouse", deal_type="sale", additional_settings={"start_page":1, "end_page":1}) ``` ### Метод get_newobjects (сбор даннных по новостройкам) Данный метод принимает следующий аргументы: * __with_saving_csv__ - необходимо ли сохранение собираемых данных (в реальном времени в процессе сбора данных) или нет, по умолчанию _False_ Пример: ```python import cianparser moscow_parser = cianparser.CianParser(location="Москва") data = moscow_parser.get_newobjects() ``` ### Дополнительные настройки поиска Пример: ```python additional_settings = { "start_page":1, "end_page": 10, "is_by_homeowner": True, "min_price": 1000000, "max_price": 10000000, "min_balconies": 1, "have_loggia": True, "min_house_year": 1990, "max_house_year": 2023, "min_floor": 3, "max_floor": 4, "min_total_floor": 5, "max_total_floor": 10, "house_material_type": 1, "metro": "Московский", "metro_station": "ВДНХ", "metro_foot_minute": 45, "flat_share": 2, "only_flat": True, "only_apartment": True, "sort_by": "price_from_min_to_max", } ``` * __object_type__ - тип жилья ("new" - вторичка, "secondary" - новостройка) * __start_page__ - страница, с которого начинается сбор данных * __end_page__ - страница, с которого заканчивается сбор данных * __is_by_homeowner__ - объявления, созданных только собственниками * __min_price__ - цена от (в рублях) * __max_price__ - цена до (в рублях) * __min_balconies__ - минимальное количество балконов * __have_loggia__ - наличие лоджи * __min_house_year__ - год постройки дома от * __max_house_year__ - год постройки дома до * __min_floor__ - этаж от * __max_floor__ - этаж до * __min_total_floor__ - этажей в доме от * __max_total_floor__ - этажей в доме до * __house_material_type__ - тип дома (_см ниже возможные значения_) * __metro__ - название метрополитена (_см ниже возможные значения_) * __metro_station__ - станция метро (доступно при заданом metro) * __metro_foot_minute__ - сколько минут до метро пешком * __flat_share__ - с долями или без (1 - только доли, 2 - без долей) * __only_flat__ - без апартаментов * __only_apartment__ - только апартаменты * __sort_by__ - сортировка объявлений (_см ниже возможные значения_) #### Возможные значения поля **house_material_type** - _1_ - киричный - _2_ - монолитный - _3_ - панельный - _4_ - блочный - _5_ - деревянный - _6_ - сталинский - _7_ - щитовой - _8_ - кирпично-монолитный #### Возможные значения полей **metro** и **metro_station** Соответствуют ключам и значениям словаря, получаемого вызовом функции **_cianparser.list_metro_stations()_** #### Возможные значения поля **sort_by** - "_price_from_min_to_max_" - сортировка по цене (сначала дешевле) - "_price_from_max_to_min_" - сортировка по цене (сначала дороже) - "_total_meters_from_max_to_min_" - сортировка по общей площади (сначала больше) - "_creation_data_from_newer_to_older_" - сортировка по дате добавления (сначала новые) - "_creation_data_from_older_to_newer_" - сортировка по дате добавления (сначала старые) ### Признаки, получаемые в ходе сбора данных с предложений по долгосрочной аренде недвижимости * __district__ - район * __underground__ - метро * __street__ - улица * __house_number__ - номер дома * __floor__ - этаж * __floors_count__ - общее количество этажей * __total_meters__ - общая площадь * __living_meters__ - жилая площади * __kitchen_meters__ - площадь кухни * __rooms_count__ - количество комнат * __year_construction__ - год постройки здания * __house_material_type__ - тип дома (киричный/монолитный/панельный итд) * __heating_type__ - тип отопления * __price_per_month__ - стоимость в месяц * __commissions__ - комиссия, взымаемая при заселении * __author__ - автор объявления * __author_type__ - тип автора * __phone__ - номер телефона в объявлении * __url__ - ссылка на объявление Возможные значения поля __author_type__: - __real_estate_agent__ - агентство недвижимости - __homeowner__ - собственник - __realtor__ - риелтор - __official_representative__ - ук оф.представитель - __representative_developer__ - представитель застройщика - __developer__ - застройщик - __unknown__ - без указанного типа ### Признаки, получаемые в ходе сбора данных с предложений по продаже недвижимости Признаки __аналогичны__ вышеописанным, кроме отсутствия полей __price_per_month__ и __commissions__. При этом появляются новые: * __price__ - стоимость недвижимости * __residential_complex__ - название жилого комплекса * __object_type__ - тип жилья (вторичка/новостройка) * __finish_type__ - отделка ### Признаки, получаемые в ходе сбора данных по новостройкам * __name__ - наименование ЖК * __url__ - ссылка на страницу * __full_location_address__ - полный адрес расположения ЖК * __year_of_construction__ - год сдачи * __house_material_type__ - тип дома (_см выше возможные значения_) * __finish_type__ - отделка * __ceiling_height__ - высота потолка * __class__ - класс жилья * __parking_type__ - тип парковки * __floors_from__ - этажность (от) * __floors_to__ - этажность (до) * __builder__ - застройщик ### Сохранение данных Имеется возможность сохранения собираемых данных в режиме реального времени. Для этого необходимо подставить в аргументе __with_saving_csv__ значение ___True___. #### Пример получаемого файла при вызове метода __get_flats__ с __with_extra_data__ = __True__: ```bash cian_flat_sale_1_1_moskva_12_Jan_2024_22_29_48_117413.csv ``` | author | author_type | url | location | deal_type | accommodation_type | floor | floors_count | rooms_count | total_meters | price_per_m2 | price | year_of_construction | object_type | house_material_type | heating_type | finish_type | living_meters | kitchen_meters | phone | district | street | house_number | underground | residential_complex | ------ | ------ | ------ | ------ | ------ | ------ | ----------- | ---- | ---- | --------- | ------------------ | ----- | ------------ | ----------- | ------------ | --------------- | ----------- | ----------- | -------------------- | --- | --- | --- | --- | --- | --- | White and Broughton | real_estate_agent | https://www.cian.ru/sale/flat/290499455/ | Москва | sale | flat | 3 | 40 | 1 | 45.5 | 709890 | 32300000 | 2021 | Вторичка | Монолитный | Центральное | -1 | 19.0 | 6.0 | +79646331510 | Хорошевский | Ленинградский проспект | 37/4 | Динамо | Прайм Парк | ФСК | developer | https://www.cian.ru/sale/flat/288376323/ | Москва | sale | flat | 24 | 47 | 2 | 46.0 | 528900 | 24329400 | 2024 | Новостройка | Монолитно-кирпичный | -1 | Без отделки, предчистовая, чистовая | 18.0 | 15.0 | +74951387154 | Обручевский | Академика Волгина | 2С1 | Калужская | Архитектор | White and Broughton | real_estate_agent | https://www.cian.ru/sale/flat/292416804/ | Москва | sale | flat | 2 | 41 | 2 | 60.0 | 783333 | 47000000 | 2021 | Вторичка | -1 | Центральное | -1 | 43.0 | 5.0 | +79646331510 | Хорошевский | Ленинградский проспект | 37/5 | Динамо | Прайм Парк #### Пример получаемого файла при вызове метода __get_suburban__ с __with_extra_data__ = __True__: ```bash cian_suburban_townhouse_sale_15_15_moskva_13_Jan_2024_04_30_47_963046.csv ``` | author | author_type | url | location | deal_type | accommodation_type | price | year_of_construction | house_material_type | land_plot | land_plot_status | heating_type | gas_type | water_supply_type | sewage_system | bathroom | living_meters | floors_count | phone | district | underground | street | house_number | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------------ | ----------- | ------------ | --------------- | ----------- | ----------- | -------------------- | --- | --- | --- | --- | --- | --- | New Moscow House | real_estate_agent | https://www.cian.ru/sale/suburban/296304861/ | Москва | sale | suburban | 93000000 | 2020 | Кирпичный | 13 сот. | -1 | -1 | Есть | Есть | Есть | В доме | -1 | 2 | +79096865868 | Первомайское поселение | | улица Центральная | 21 | LaRichesse | real_estate_agent | https://www.cian.ru/sale/suburban/290335502/ | Москва | sale | suburban | 95000000 | -1 | Пенобетонный блок | 12 сот. | Индивидуальное жилищное строительство | Центральное | -1 | -1 | -1 | -1 | 502,8 м² | 2 | +79652502027 | Воскресенское поселение | | улица Каменка | 44Ас1 | Динара Ваганова | realtor | https://www.cian.ru/sale/suburban/293424451/ | Москва | sale | suburban | 21990000 | -1 | -1 | -1 | Индивидуальное жилищное строительство | -1 | Нет | -1 | Нет | -1 | -1 | -1 | +79672093870 | Первомайское поселение | м. Крёкшино | | #### Пример получаемого файла при вызове метода __get_newobjects__: ```bash cian_newobject_13_Jan_2024_01_27_32_734734.csv ``` | name | location | accommodation_type | url | full_location_address | year_of_construction | house_material_type | finish_type | ceiling_height | class | parking_type | floors_from | floors_to | builder | ----- | ------------ | ----------- | ------------ | --------------- | ----------- | ----------- | -------------------- | --- | --- | --- | --- | --- | --- | ЖК «SYMPHONY 34 (Симфони 34)» | Москва | newobject | https://zhk-symphony-34-i.cian.ru | Москва, САО, Савеловский, 2-я Хуторская ул., 34 | 2025 | Монолитный | Предчистовая, чистовая | 3,0 м | Премиум | Подземная, гостевая | 36 | 54 | Застройщик MR Group | ЖК «Коллекция клубных особняков Ильинка 3/8» | Москва | newobject | https://zhk-kollekciya-klubnyh-osobnyakov-ilinka-38-i.cian.ru | Москва, ЦАО, Тверской, ул. Ильинка | 2024 | Монолитно-кирпичный, монолитный | Без отделки | от 3,35 м до 6,0 м | Премиум | Подземная, гостевая | 3 | 5 | Застройщик Sminex-Интеко | ЖК «Victory Park Residences (Виктори Парк Резиденсез)» | Москва | newobject | https://zhk-victory-park-residences-i.cian.ru | Москва, ЗАО, Дорогомилово, ул. Братьев Фонченко | 2024 | Монолитный | Чистовая | — | Премиум | Подземная | 10 | 11 | Застройщик ANT Development ### Cloudflare, CloudScraper, Proxy Для обхода блокировки в проекте задействован **CloudScraper** (библиотека **cloudscraper**), который позволяет успешно обходить защиту **Cloudflare**. Вместе с тем, это не гарантирует отсутствие возможности появления _у некоторых пользователей_ теста **CAPTCHA** при долговременном непрерывном использовании. #### Proxy Поэтому была предоставлена возможность проставлять прокси, используя аргумент **proxies** (_список прокси протокола HTTPS_) Пример: ```python proxies = [ '117.250.3.58:8080', '115.96.208.124:8080', '152.67.0.109:80', '45.87.68.2:15321', '68.178.170.59:80', '20.235.104.105:3729', '195.201.34.206:80', ] ``` В процессе запуска утилита проходится по всем из них, пытаясь определить подходящий, то есть тот, который может, во первых, делать запросы, во вторых, не иметь тест **_CAPTCHA_** Пример лога, в котором представлено все три возможных кейса ``` The process of checking the proxies... Search an available one among them... 1 | proxy 46.47.197.210:3128: unavailable.. trying another 2 | proxy 213.184.153.66:8080: there is captcha.. trying another 3 | proxy 95.66.138.21:8880: available.. stop searching ``` ### Ограничения Сайт выдает списки с объявлениями __лишь до 54 странцы включительно__. Это примерно _28 * 54 = 1512_ объявлений. Поэтому, если имеется желание собрать как можно больше данных, то следует использовать более конкретные запросы (по количеству комнат). К примеру, вместо того, чтобы при использовании указывать _rooms=(1, 2)_, стоит два раза отдельно собирать данные с параметрами _rooms=1_ и _rooms=2_ соответственно. Таким образом, максимальная разница может составить 1 к 6 (студия, 1, 2, 3, 4, 5 комнатные квартиры), то есть 1512 к 9072. ### Примечание 1. В некоторых объявлениях отсутсвуют данные по некоторым признакам (_год постройки, жилые кв метры, кв метры кухни итп_). В этом случае проставляется значение ___-1___ либо ___пустая строка___ для числового и строкового типа поля соответственно. 2. Для отсутствия блокировки по __IP__ в данном проекте задана пауза (___в размере 4-5 секунд___) после сбора информации с каждой отдельной взятой страницы. 3. Не рекомендутся производить несколько процессов сбора данных параллельно (одновременно) на одной машине (см. пункт 2). 4. Имеется флаг __with_extra_data__, при помощи которого можно дополнительно собирать некоторые данные, но при этом существенно (___в 5-10 раз___) замедляется процесс по времени, из-за необходимости заходить на каждую страницу с предложением. Соответствующие данные: ___площадь кухни, год постройки здания, тип дома, тип отделки, тип отопления, тип жилья___ и ___номер телефона___. 5. Данный парсер не будет работать в таком инструменте как [Google Colaboratory](https://colab.research.google.com/). См. [подробности](https://github.com/lenarsaitov/cianparser/issues/1) 6. Если в проекте не имеется подходящего локации (неожидаемое значение аргумента __location__) или иными словами его нет в списке **_cianparser.list_locations()_**, то прошу сообщить, буду рад добавить. ================================================ FILE: cianparser/__init__.py ================================================ from .cianparser import CianParser, list_locations, list_metro_stations __author__ = "lenarsaitov" __mail__ = "lenarsaitov1@yandex.ru" ================================================ FILE: cianparser/base_list.py ================================================ import math import csv from cianparser.constants import SPECIFIC_FIELDS_FOR_RENT_LONG, SPECIFIC_FIELDS_FOR_RENT_SHORT, SPECIFIC_FIELDS_FOR_SALE class BaseListPageParser: def __init__(self, session, accommodation_type: str, deal_type: str, rent_period_type, location_name: str, with_saving_csv=False, with_extra_data=False, object_type=None, additional_settings=None): self.accommodation_type = accommodation_type self.session = session self.deal_type = deal_type self.rent_period_type = rent_period_type self.location_name = location_name self.with_saving_csv = with_saving_csv self.with_extra_data = with_extra_data self.additional_settings = additional_settings self.object_type = object_type self.result = [] self.result_set = set() self.average_price = 0 self.count_parsed_offers = 0 self.start_page = 1 if (additional_settings is None or "start_page" not in additional_settings.keys()) else additional_settings["start_page"] self.end_page = 100 if (additional_settings is None or "end_page" not in additional_settings.keys()) else additional_settings["end_page"] self.file_path = self.build_file_path() def is_sale(self): return self.deal_type == "sale" def is_rent_long(self): return self.deal_type == "rent" and self.rent_period_type == 4 def is_rent_short(self): return self.deal_type == "rent" and self.rent_period_type == 2 def build_file_path(self): pass def define_average_price(self, price_data): if "price" in price_data: self.average_price = (self.average_price * self.count_parsed_offers + price_data["price"]) / self.count_parsed_offers elif "price_per_month" in price_data: self.average_price = (self.average_price * self.count_parsed_offers + price_data["price_per_month"]) / self.count_parsed_offers def print_parse_progress(self, page_number, count_of_pages, offers, ind): total_planed_offers = len(offers) * count_of_pages print(f"\r {page_number - self.start_page + 1}" f" | {page_number} page with list: [" + "=>" * (ind + 1) + " " * (len(offers) - ind - 1) + "]" + f" {math.ceil((ind + 1) * 100 / len(offers))}" + "%" + f" | Count of all parsed: {self.count_parsed_offers}." f" Progress ratio: {math.ceil(self.count_parsed_offers * 100 / total_planed_offers)} %." f" Average price: {'{:,}'.format(int(self.average_price)).replace(',', ' ')} rub", end="\r", flush=True) def remove_unnecessary_fields(self): if self.is_sale(): for not_need_field in SPECIFIC_FIELDS_FOR_RENT_LONG: if not_need_field in self.result[-1]: del self.result[-1][not_need_field] for not_need_field in SPECIFIC_FIELDS_FOR_RENT_SHORT: if not_need_field in self.result[-1]: del self.result[-1][not_need_field] if self.is_rent_long(): for not_need_field in SPECIFIC_FIELDS_FOR_RENT_SHORT: if not_need_field in self.result[-1]: del self.result[-1][not_need_field] for not_need_field in SPECIFIC_FIELDS_FOR_SALE: if not_need_field in self.result[-1]: del self.result[-1][not_need_field] if self.is_rent_short(): for not_need_field in SPECIFIC_FIELDS_FOR_RENT_LONG: if not_need_field in self.result[-1]: del self.result[-1][not_need_field] for not_need_field in SPECIFIC_FIELDS_FOR_SALE: if not_need_field in self.result[-1]: del self.result[-1][not_need_field] return self.result def save_results(self): self.remove_unnecessary_fields() keys = self.result[0].keys() with open(self.file_path, 'w', newline='', encoding='utf-8') as output_file: dict_writer = csv.DictWriter(output_file, keys, delimiter=';') dict_writer.writeheader() dict_writer.writerows(self.result) ================================================ FILE: cianparser/cianparser.py ================================================ import cloudscraper import time from cianparser.constants import CITIES, METRO_STATIONS, DEAL_TYPES, OBJECT_SUBURBAN_TYPES from cianparser.url_builder import URLBuilder from cianparser.proxy_pool import ProxyPool from cianparser.flat.list import FlatListPageParser from cianparser.suburban.list import SuburbanListPageParser from cianparser.newobject.list import NewObjectListParser def list_locations(): return CITIES def list_metro_stations(): return METRO_STATIONS class CianParser: def __init__(self, location: str, proxies=None): """ Initialize the Cian website parser Examples: >>> moscow_parser = cianparser.CianParser(location="Москва") :param str location: location. e.g. "Москва", for see all correct values use cianparser.list_locations() :param proxies: proxies for executing requests (https scheme), default None """ location_id = __validation_init__(location) self.__parser__ = None self.__session__ = cloudscraper.create_scraper() self.__session__.headers = {'Accept-Language': 'en'} self.__proxy_pool__ = ProxyPool(proxies=proxies) self.__location_name__ = location self.__location_id__ = location_id def __set_proxy__(self, url_list): if self.__proxy_pool__.is_empty(): return available_proxy = self.__proxy_pool__.get_available_proxy(url_list) if available_proxy is not None: self.__session__.proxies = {"https": available_proxy} def __load_list_page__(self, url_list_format, page_number, attempt_number_exception): url_list = url_list_format.format(page_number) self.__set_proxy__(url_list) if page_number == self.__parser__.start_page and attempt_number_exception == 0: print(f"The page from which the collection of information begins: \n {url_list}") res = self.__session__.get(url=url_list) if res.status_code == 429: time.sleep(10) res.raise_for_status() return res.text def __run__(self, url_list_format: str): print(f"\n{' ' * 30}Preparing to collect information from pages..") if self.__parser__.with_saving_csv: print(f"The absolute path to the file: \n{self.__parser__.file_path} \n") page_number = self.__parser__.start_page - 1 end_all_parsing = False while page_number < self.__parser__.end_page and not end_all_parsing: page_parsed = False page_number += 1 attempt_number_exception = 0 while attempt_number_exception < 3 and not page_parsed: try: (page_parsed, attempt_number, end_all_parsing) = self.__parser__.parse_list_offers_page( html=self.__load_list_page__(url_list_format=url_list_format, page_number=page_number, attempt_number_exception=attempt_number_exception), page_number=page_number, count_of_pages=self.__parser__.end_page + 1 - self.__parser__.start_page, attempt_number=attempt_number_exception) except Exception as e: attempt_number_exception += 1 if attempt_number_exception < 3: continue print(f"\n\nException: {e}") print(f"The collection of information from the pages with ending parse on {page_number} page...\n") break print(f"\n\nThe collection of information from the pages with list of offers is completed") print(f"Total number of parsed offers: {self.__parser__.count_parsed_offers}. ", end="\n") def get_flats(self, deal_type: str, rooms, with_saving_csv=False, with_extra_data=False, additional_settings=None): """ Parse information of flats from cian website Examples: >>> moscow_parser = cianparser.CianParser(location="Москва") >>> data = moscow_parser.get_flats(deal_type="rent_long", rooms=1) >>> data = moscow_parser.get_flats(deal_type="rent_short", rooms=(1,3,"studio"), with_saving_csv=True) >>> data = moscow_parser.get_flats(deal_type="sale", additional_settings={"start_page": 1, "end_page": 1, "sort_by":"price_from_min_to_max"}) :param deal_type: type of deal, e.g. "rent_long", "rent_short", "sale" :param rooms: how many rooms in accommodation, default "all". Example 1, (1,3, "studio"), "studio, "all" :param with_saving_csv: is it necessary to save data in csv, default False :param with_extra_data: is it necessary to collect additional data (but with increasing time duration), default False :param additional_settings: additional settings such as min_price, sort_by and others, default None """ __validation_get_flats__(deal_type, rooms) deal_type, rent_period_type = __define_deal_type__(deal_type) self.__parser__ = FlatListPageParser( session=self.__session__, accommodation_type="flat", deal_type=deal_type, rent_period_type=rent_period_type, location_name=self.__location_name__, with_saving_csv=with_saving_csv, with_extra_data=with_extra_data, additional_settings=additional_settings, ) self.__run__( __build_url_list__(location_id=self.__location_id__, deal_type=deal_type, accommodation_type="flat", rooms=rooms, rent_period_type=rent_period_type, additional_settings=additional_settings)) return self.__parser__.result def get_suburban(self, suburban_type: str, deal_type: str, with_saving_csv=False, with_extra_data=False, additional_settings=None): """ Parse information of suburbans from cian website Examples: >>> moscow_parser = cianparser.CianParser(location="Москва") >>> data = moscow_parser.get_suburbans(suburban_type="house",deal_type="rent_long") >>> data = moscow_parser.get_suburbans(suburban_type="house",deal_type="rent_short", with_saving_csv=True) >>> data = moscow_parser.get_suburbans(suburban_type="townhouse",deal_type="sale", additional_settings={"start_page": 1, "end_page": 1, "sort_by":"price_from_min_to_max"}) :param suburban_type: type of suburban building, e.g. "house", "house-part", "land-plot", "townhouse" :param deal_type: type of deal, e.g. "rent_long", "rent_short", "sale" :param with_saving_csv: is it necessary to save data in csv, default False :param with_extra_data: is it necessary to collect additional data (but with increasing time duration), default False :param additional_settings: additional settings such as min_price, sort_by and others, default None """ __validation_get_suburban__(suburban_type=suburban_type, deal_type=deal_type) deal_type, rent_period_type = __define_deal_type__(deal_type) self.__parser__ = SuburbanListPageParser( session=self.__session__, accommodation_type="suburban", deal_type=deal_type, rent_period_type=rent_period_type, location_name=self.__location_name__, with_saving_csv=with_saving_csv, with_extra_data=with_extra_data, additional_settings=additional_settings, object_type=suburban_type, ) self.__run__( __build_url_list__(location_id=self.__location_id__, deal_type=deal_type, accommodation_type="suburban", rooms=None, rent_period_type=rent_period_type, suburban_type=suburban_type, additional_settings=additional_settings)) return self.__parser__.result def get_newobjects(self, with_saving_csv=False): """ Parse information of newobjects from cian website Examples: >>> moscow_parser = cianparser.CianParser(location="Москва") >>> data = moscow_parser.get_newobjects(with_saving_csv=True) :param with_saving_csv: is it necessary to save data in csv, default False """ self.__parser__ = NewObjectListParser( session=self.__session__, location_name=self.__location_name__, with_saving_csv=with_saving_csv, ) self.__run__( __build_url_list__(location_id=self.__location_id__, deal_type="sale", accommodation_type="newobject")) return self.__parser__.result def __validation_init__(location): location_id = None for location_info in list_locations(): if location_info[0] == location: location_id = location_info[1] if location_id is None: ValueError(f'You entered {location}, which is not exists in base.' f' See all available values of location in cianparser.list_locations()') return location_id def __validation_get_flats__(deal_type, rooms): if deal_type not in DEAL_TYPES: raise ValueError(f'You entered deal_type={deal_type}, which is not valid value. ' f'Try entering one of these values: "rent_long", "sale".') if type(rooms) is tuple: for count_of_room in rooms: if type(count_of_room) is int: if count_of_room < 1 or count_of_room > 5: raise ValueError(f'You entered {count_of_room} in {rooms}, which is not valid value. ' f'Try entering one of these values: 1, 2, 3, 4, 5, "studio", "all".') elif type(count_of_room) is str: if count_of_room != "studio": raise ValueError(f'You entered {count_of_room} in {rooms}, which is not valid value. ' f'Try entering one of these values: 1, 2, 3, 4, 5, "studio", "all".') else: raise ValueError(f'In tuple "rooms" not valid type of element. ' f'It is correct int and str types. Example (1,3,5, "studio").') elif type(rooms) is int: if rooms < 1 or rooms > 5: raise ValueError(f'You entered rooms={rooms}, which is not valid value. ' f'Try entering one of these values: 1, 2, 3, 4, 5, "studio", "all".') elif type(rooms) is str: if rooms != "studio" and rooms != "all": raise ValueError(f'You entered rooms={rooms}, which is not valid value. ' f'Try entering one of these values: 1, 2, 3, 4, 5, "studio", "all".') else: raise ValueError(f'In argument "rooms" not valid type of element. ' f'It is correct int, str and tuple types. Example 1, (1,3, "studio"), "studio, "all".') def __validation_get_suburban__(suburban_type, deal_type): if suburban_type not in OBJECT_SUBURBAN_TYPES.keys(): raise ValueError(f'You entered suburban_type={suburban_type}, which is not valid value. ' f'Try entering one of these values: "house", "house-part", "land-plot", "townhouse".') if deal_type not in DEAL_TYPES: raise ValueError(f'You entered deal_type={deal_type}, which is not valid value. ' f'Try entering one of these values: "rent_long", "sale".') def __build_url_list__(location_id, deal_type, accommodation_type, rooms=None, rent_period_type=None, suburban_type=None, additional_settings=None): url_builder = URLBuilder(accommodation_type == "newobject") url_builder.add_location(location_id) url_builder.add_deal_type(deal_type) url_builder.add_accommodation_type(accommodation_type) if rooms is not None: url_builder.add_room(rooms) if rent_period_type is not None: url_builder.add_rent_period_type(rent_period_type) if suburban_type is not None: url_builder.add_object_suburban_type(suburban_type) if additional_settings is not None: url_builder.add_additional_settings(additional_settings) return url_builder.get_url() def __define_deal_type__(deal_type): rent_period_type = None if deal_type == "rent_long": deal_type, rent_period_type = "rent", 4 elif deal_type == "rent_short": deal_type, rent_period_type = "rent", 2 return deal_type, rent_period_type ================================================ FILE: cianparser/constants.py ================================================ DEAL_TYPES = {"rent_long", "sale"} OBJECT_SUBURBAN_TYPES = {"house": "1", "house-part": "2", "land-plot": "3", "townhouse": "4"} OBJECT_TYPES = {"secondary": "1", "new": "2"} # DEAL_TYPES_NOT_IMPLEMENTED_YET = {"rent_short"} # ACCOMMODATION_TYPES_NOT_IMPLEMENTED_YET = {"room", "house", "house-part", "townhouse"} FLOATS_NUMBERS_REG_EXPRESSION = r"[+-]? *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?" FILE_NAME_FLAT_FORMAT = 'cian_{}_{}_{}_{}_{}_{}.csv' FILE_NAME_SUBURBAN_FORMAT = 'cian_{}_{}_{}_{}_{}_{}_{}.csv' FILE_NAME_NEWOBJECT_FORMAT = 'cian_{}_{}_{}.csv' BASE_URL = "https://cian.ru" DEFAULT_POSTFIX_PATH = "/cat.php?" NEWOBJECT_POSTFIX_PATH = "/newobjects/list/?" DEFAULT_PATH = "engine_version=2&p={}&with_neighbors=0" REGION_PATH = "®ion={}" OFFER_TYPE_PATH = "&offer_type={}" RENT_PERIOD_TYPE_PATH = "&type={}" DEAL_TYPE_PATH = "&deal_type={}" OBJECT_TYPE_PATH = "&object_type%5B0%5D={}" ROOM_PATH = "&room{}=1" STUDIO_PATH = "&room9=1" IS_ONLY_HOMEOWNER_PATH = "&is_by_homeowner=1" MIN_BALCONIES_PATH = "&min_balconies={}" HAVE_LOGGIA_PATH = "&loggia=1" MIN_HOUSE_YEAR_PATH = "&min_house_year={}" MAX_HOUSE_YEAR_PATH = "&max_house_year={}" MIN_PRICE_PATH = "&minprice={}" MAX_PRICE_PATH = "&maxprice={}" MIN_FLOOR_PATH = "&minfloor={}" MAX_FLOOR_PATH = "&maxfloor={}" MIN_TOTAL_FLOOR_PATH = "&minfloorn={}" MAX_TOTAL_FLOOR_PATH = "&maxfloorn={}" HOUSE_MATERIAL_TYPE_PATH = "&house_material%5B0%5D={}" METRO_FOOT_MINUTE_PATH = "&only_foot=2&foot_min={}" METRO_ID_PATH = "&metro%5B0%5D={}" FLAT_SHARE_PATH = "&flat_share={}" ONLY_FLAT_PATH = "&only_flat={}" APARTMENT_PATH = "&apartment={}" SORT_BY_PRICE_FROM_MIN_TO_MAX_PATH = "&sort=price_object_order" SORT_BY_PRICE_FROM_MAX_TO_MIN_PATH = "&sort=total_price_desc" SORT_BY_TOTAL_METERS_FROM_MAX_TO_MIN_PATH = "&sort=area_order" SORT_BY_CREATION_DATA_FROM_NEWER_TO_OLDER_PATH = "&sort=creation_date_desc" SORT_BY_CREATION_DATA_FROM_OLDER_TO_NEWER_PATH = "&sort=creation_date_asc" IS_SORT_BY_PRICE_FROM_MIN_TO_MAX_PATH = "price_from_min_to_max" IS_SORT_BY_PRICE_FROM_MAX_TO_MIN_PATH = "price_from_max_to_min" IS_SORT_BY_TOTAL_METERS_FROM_MAX_TO_MIN_PATH = "total_meters_from_max_to_min" IS_SORT_BY_CREATION_DATA_FROM_NEWER_TO_OLDER_PATH = "creation_data_from_newer_to_older" IS_SORT_BY_CREATION_DATA_FROM_OLDER_TO_NEWER_PATH = "creation_data_from_older_to_newer" NOT_STREET_ADDRESS_ELEMENTS = {"ЖК", "м.", "мкр.", "Жилой комплекс", "Жилой Комплекс"} STREET_TYPES = {"ул.", "улица", "аллея", "бульвар", "линия", "набережная", "тракт", "тупик", "шоссе", "переулок", "проспект", "проезд", "раздъезд", "мост", "авеню"} SPECIFIC_FIELDS_FOR_RENT_LONG = {"price_per_month", "commissions"} SPECIFIC_FIELDS_FOR_RENT_SHORT = {"price_per_day"} SPECIFIC_FIELDS_FOR_SALE = {"price", "residential_complex", "object_type", "finish_type"} CITIES = [ ['Москва', '1'], ['Санкт-Петербург', '2'], ['Абакан', '4638'], ['Анадырь', '4648'], ['Архангельск', '4658'], ['Астрахань', '4660'], ['Барнаул', '4668'], ['Белгород', '4671'], ['Биробиджан', '4682'], ['Благовещенск', '4683'], ['Бронницы', '4690'], ['Брянск', '4691'], ['Великий Новгород', '4694'], ['Владивосток', '4701'], ['Владикавказ', '4702'], ['Владимир', '4703'], ['Волгоград', '4704'], ['Вологда', '4708'], ['Воронеж', '4713'], ['Геленджик', '4717'], ['Горно-Алтайск', '4719'], ['Грозный', '4723'], ['Дзержинский', '4734'], ['Долгопрудный', '4738'], ['Дубна', '4741'], ['Екатеринбург', '4743'], ['Жуковский', '4750'], ['Звенигород', '4756'], ['Иванов', '4767'], ['Ижевск', '4770'], ['Иркутск', '4774'], ['Йошкар-Ола', '4776'], ['Казань', '4777'], ['Калининград', '4778'], ['Калуга', '4780'], ['Кемерово', '4795'], ['Киров', '4800'], ['Коломна', '4809'], ['Королёв', '4813'], ['Красноармейск', '4817'], ['Краснодар', '4820'], ['Краснознаменск', '4822'], ['Красноярск', '4827'], ['Курган', '4834'], ['Курск', '4835'], ['Кызыл', '4837'], ['Липецк', '4847'], ['Лобня', '4848'], ['Лыткарино', '4851'], ['Магадан', '4852'], ['Майкоп', '4855'], ['Махачкала', '4857'], ['Мурманск', '4871'], ['Нальчик', '4875'], ['Нарьян-Мар', '4876'], ['Нижний Новгород', '4885'], ['Новороссийск', '4896'], ['Новокузнецк', '4894'], ['Новосибирск', '4897'], ['Омск', '4914'], ['Оренбург', '4915'], ['Орехово-Зуево', '4916'], ['Пенза', '4923'], ['Пермь', '4927'], ['Петрозаводск', '4930'], ['Петропавловск-Камчатский', '4931'], ['Подольск', '4935'], ['Протвино', '4945'], ['Псков', '4946'], ['Пущино', '4949'], ['Реутов', '4958'], ['Ростов-на-Дону', '4959'], ['Рошаль', '4960'], ['Рязань', '4963'], ['Салехард', '4965'], ['Самара', '4966'], ['Саранск', '4967'], ['Саратов', '4969'], ['Серпухов', '4983'], ['Смоленск', '4987'], ['Сочи', '4998'], ['Ставрополь', '5001'], ['Сургут', '5003'], ['Сыктывкар', '5006'], ['Тамбов', '5011'], ['Тольятти', '5015'], ['Томск', '5016'], ['Тула', '5020'], ['Тюмень', '5024'], ['Улан-Удэ', '5026'], ['Ульяновск', '5027'], ['Фрязино', '5038'], ['Хабаровск', '5039'], ['Ханты-Мансийск', '5041'], ['Химки', '5044'], ['Чебоксары', '5047'], ['Челябинск', '5048'], ['Череповец', '5050'], ['Черкесск', '5051'], ['Чита', '5053'], ['Электросталь', '5064'], ['Элиста', '5065'], ['Южно-Сахалинск', '5069'], ['Якутск', '5073'], ['Ярославль', '5075'], ] OTHER_CITIES = [ ['Азов', '174136'], ['Аксай', '174151'], ['Альметьевск', '174184'], ['Анапа', '174191'], ['Балашиха', '174292'], ['Бокситогорск', '174373'], ['Бора', '174402'], ['Видное', '174508'], ['Волоколамск', '174522'], ['Воскресенск', '174530'], ['Высоковск', '174541'], ['Голицын', '174573'], ['Дмитров', '174634'], ['Домодедово', '174640'], ['Дрезна', '174644'], ['Егорьевск', '174659'], ['Истра', '174832'], ['Кашира', '174957'], ['Клин', '175004'], ['Кострома', '175050'], ['Котельник', '175051'], ['Красногорск', '175071'], ['Краснозаводск', '175075'], ['Кубинка', '175104'], ['Ликино-Дулёво', '175209'], ['Лосино-Петровский', '175219'], ['Луховицы', '175226'], ['Люберцы', '175231'], ['Можайск', '175349'], ['Мытищи', '175378'], ['Набережные Челны', '175380'], ['Назрань', '175389'], ['Одинцово', '175578'], ['Орёл', '175604'], ['Павловский Посад', '175635'], ['Пушкин', '175744'], ['Раменское', '175758'], ['Руза', '175785'], ['Сергиевом Посад', '175864'], ['Солнечногорск', '175903'], ['Ступино', '175996'], ['Талдом', '176052'], ['Тверь', '176083'], ['Уфа', '176245'], ['Хотьково', '176281'], ['Черноголовка', '176316'], ['Чехов', '176321'], ['Шатура', '176366'], ['Щёлково', '176401'], ['Электрогорск', '176405'], ['Яхрома', '176463'], ] CITIES.extend(OTHER_CITIES) METRO_STATIONS = { "Московский": [ ['Авиамоторная', '1'], ['Автозаводская', '2'], ['Академическая', '3'], ['Александровский сад', '4'], ['Алексеевская', '5'], ['Алтуфьево', '6'], ['Аннино', '7'], ['Арбатская', '8'], ['Аэропорт', '9'], ['Бабушкинская', '10'], ['Багратионовская', '11'], ['Баррикадная', '12'], ['Бауманская', '13'], ['Беговая', '14'], ['Белорусская', '15'], ['Беляево', '16'], ['Бибирево', '17'], ['Библиотека им. Ленина', '18'], ['Новоясеневская', '19'], ['Боровицкая', '20'], ['Ботанический сад', '21'], ['Братиславская', '22'], ['Бульвар Адмирала Ушакова', '23'], ['Бульвар Дмитрия Донского', '24'], ['Бунинская аллея', '25'], ['Варшавская', '26'], ['ВДНХ', '27'], ['Владыкино', '28'], ['Водный стадион', '29'], ['Войковская', '30'], ['Волгоградский проспект', '31'], ['Волжская', '32'], ['Воробьёвы горы', '33'], ['Выхино', '34'], ['Выставочная', '35'], ['Динамо', '36'], ['Дмитровская', '37'], ['Добрынинская', '38'], ['Домодедовская', '39'], ['Дубровка', '40'], ['Измайловская', '41'], ['Калужская', '42'], ['Кантемировская', '43'], ['Каховская', '44'], ['Каширская', '45'], ['Киевская', '46'], ['Китай-город', '47'], ['Кожуховская', '48'], ['Коломенская', '49'], ['Комсомольская', '50'], ['Коньково', '51'], ['Красногвардейская', '52'], ['Красносельская', '53'], ['Красные ворота', '54'], ['Крестьянская застава', '55'], ['Кропоткинская', '56'], ['Крылатское', '57'], ['Кузнецкий мост', '58'], ['Кузьминки', '59'], ['Кунцевская', '60'], ['Курская', '61'], ['Кутузовская', '62'], ['Ленинский проспект', '63'], ['Лубянка', '64'], ['Люблино', '65'], ['Марксистская', '66'], ['Марьино', '67'], ['Маяковская', '68'], ['Медведково', '69'], ['Международная', '70'], ['Менделеевская', '71'], ['Молодёжная', '72'], ['Нагатинская', '73'], ['Нагорная', '74'], ['Нахимовский проспект', '75'], ['Новогиреево', '76'], ['Новокузнецкая', '77'], ['Новослободская', '78'], ['Новые Черёмушки', '79'], ['Октябрьская', '80'], ['Октябрьское поле', '81'], ['Орехово', '82'], ['Отрадное', '83'], ['Охотный ряд', '84'], ['Павелецкая', '85'], ['Парк Культуры', '86'], ['Парк Победы', '87'], ['Партизанская', '88'], ['Первомайская', '89'], ['Перово', '90'], ['Петровско-Разумовская', '91'], ['Печатники', '92'], ['Пионерская', '93'], ['Планерная', '94'], ['Площадь Ильича', '95'], ['Площадь Революции', '96'], ['Полежаевская', '97'], ['Полянка', '98'], ['Пражская', '99'], ['Преображенская площадь', '100'], ['Пролетарская', '101'], ['Проспект Вернадского', '102'], ['Проспект Мира', '103'], ['Профсоюзная', '104'], ['Пушкинская', '105'], ['Речной вокзал', '106'], ['Рижская', '107'], ['Римская', '108'], ['Рязанский проспект', '109'], ['Савёловская', '110'], ['Свиблово', '111'], ['Севастопольская', '112'], ['Семёновская', '113'], ['Серпуховская', '114'], ['Смоленская', '115'], ['Сокол', '116'], ['Сокольники', '117'], ['Спортивная', '118'], ['Сретенский бульвар', '119'], ['Студенческая', '120'], ['Сухаревская', '121'], ['Сходненская', '122'], ['Таганская', '123'], ['Тверская', '124'], ['Театральная', '125'], ['Текстильщики', '126'], ['Тёплый Стан', '127'], ['Тимирязевская', '128'], ['Третьяковская', '129'], ['Трубная', '130'], ['Тульская', '131'], ['Тургеневская', '132'], ['Тушинская', '133'], ['Улица 1905 года', '134'], ['Улица Академика Янгеля', '135'], ['Улица Горчакова', '136'], ['Бульвар Рокоссовского', '137'], ['Улица Скобелевская', '138'], ['Улица Старокачаловская', '139'], ['Университет', '140'], ['Филёвский парк', '141'], ['Фили', '142'], ['Фрунзенская', '143'], ['Царицыно', '144'], ['Цветной бульвар', '145'], ['Черкизовская', '146'], ['Чертановская', '147'], ['Чеховская', '148'], ['Чистые пруды', '149'], ['Чкаловская', '150'], ['Шаболовская', '151'], ['Шоссе Энтузиастов', '152'], ['Щёлковская', '153'], ['Щукинская', '154'], ['Электрозаводская', '155'], ['Юго-Западная', '156'], ['Южная', '157'], ['Ясенево', '158'], ['Краснопресненская', '159'], ['Строгино', '228'], ['Славянский бульвар', '229'], ['Мякинино', '233'], ['Волоколамская', '234'], ['Митино', '235'], ['Марьина Роща', '236'], ['Шипиловская', '238'], ['Зябликово', '239'], ['Борисово', '240'], ['Новокосино', '243'], ['Пятницкое шоссе', '244'], ['Алма-Атинская', '245'], ['Жулебино', '270'], ['Лермонтовский проспект', '271'], ['Деловой центр', '272'], ['Лесопарковая', '273'], ['Битцевский парк', '274'], ['Спартак', '275'], ['Улица Сергея Эйзенштейна', '276'], ['Выставочный центр', '277'], ['Улица Академика Королёва', '278'], ['Телецентр', '279'], ['Улица Милашенкова', '280'], ['Тропарёво', '281'], ['Котельники', '282'], ['Технопарк', '283'], ['Румянцево', '284'], ['Саларьево', '285'], ['Фонвизинская', '286'], ['Бутырская', '287'], ['Хорошёво', '289'], ['Зорге', '290'], ['Панфиловская', '291'], ['Стрешнево', '292'], ['Балтийская', '293'], ['Коптево', '294'], ['Лихоборы', '295'], ['Окружная', '296'], ['Ростокино', '297'], ['Белокаменная', '298'], ['Локомотив', '299'], ['Измайлово', '300'], ['Соколиная гора', '301'], ['Андроновка', '302'], ['Нижегородская', '303'], ['Новохохловская', '304'], ['Угрешская', '305'], ['ЗИЛ', '306'], ['Верхние котлы', '307'], ['Крымская', '308'], ['Площадь Гагарина', '309'], ['Лужники', '310'], ['Шелепиха', '311'], ['Минская', '337'], ['Ломоносовский проспект', '338'], ['Раменки', '339'], ['Ховрино', '349'], ['Петровский Парк', '350'], ['Хорошёвская', '351'], ['ЦСКА', '352'], ['Верхние Лихоборы', '353'], ['Селигерская', '354'], ['Мичуринский проспект', '361'], ['Озёрная', '362'], ['Говорово', '363'], ['Солнцево', '364'], ['Боровское шоссе', '365'], ['Новопеределкино', '366'], ['Рассказовка', '367'], ['Беломорская', '369'], ['Косино', '370'], ['Улица Дмитриевского', '371'], ['Лухмановская', '372'], ['Некрасовка', '373'], ['Юго-Восточная', '374'], ['Окская', '375'], ['Стахановская', '376'], ['Филатов Луг', '377'], ['Прокшино', '378'], ['Ольховая', '379'], ['Коммунарка', '380'], ['Лефортово', '381'], ['Шереметьевская', '383'], ['Рижская', '384'], ['Сокольники', '385'], ['Электрозаводская', '386'], ['Кленовый бульвар', '387'], ['Нагатинский Затон', '388'], ['Зюзино', '389'], ['Воронцовская', '390'], ['Новаторская', '391'], ['Аминьевская', '392'], ['Давыдково', '393'], ['Кунцевская', '394'], ['Мнёвники', '395'], ['Терехово ', '396'], ['Карамышевская', '397'], ['Яхромская', '398'], ['Лианозово', '399'], ['Тестовская', '400'], ['Рабочий посёлок', '401'], ['Сетунь', '402'], ['Немчиновка', '403'], ['Сколково', '404'], ['Баковка', '405'], ['Одинцово', '406'], ['Лобня', '407'], ['Хлебниково', '408'], ['Водники', '409'], ['Долгопрудная', '410'], ['Новодачная', '411'], ['Марк', '412'], ['Бескудниково', '413'], ['Дегунино', '414'], ['Нахабино', '415'], ['Аникеевка', '416'], ['Опалиха', '417'], ['Красногорская', '418'], ['Павшино', '419'], ['Пенягино', '420'], ['Трикотажная', '421'], ['Стрешнево', '422'], ['Красный Балтиец', '423'], ['Гражданская', '424'], ['Москва-Товарная', '425'], ['Калитники', '426'], ['Люблино', '427'], ['Депо', '428'], ['Перерва', '429'], ['Москворечье', '430'], ['Покровское', '431'], ['Красный Строитель', '432'], ['Битца', '433'], ['Щербинка', '434'], ['Силикатная', '435'], ['Подольск', '436'], ['Бутово', '437'], ['Остафьево', '438'], ['Курьяново', '439'], ['Народное Ополчение', '440'], ['Площадь трёх вокзалов', '441'], ['Авиамоторная', '443'], ['Деловой центр', '444'], ['Каширская', '445'], ['Лефортово', '446'], ['Мичуринский проспект', '447'], ['Нижегородская', '448'], ['Печатники', '449'], ['Проспект Вернадского', '450'], ['Савёловская', '451'], ['Текстильщики', '452'], ['Шелепиха', '453'], ['Марьина Роща', '454'], ['Зеленоград — Крюково', '455'], ['Фирсановская', '456'], ['Сходня', '457'], ['Подрезково', '458'], ['Новоподрезково', '459'], ['Молжаниново', '460'], ['Химки', '461'], ['Левобережная', '462'], ['Ховрино', '463'], ['Грачёвская', '464'], ['Моссельмаш', '465'], ['Лихоборы', '466'], ['Петровско-Разумовская', '467'], ['Останкино', '468'], ['Электрозаводская', '470'], ['Сортировочная', '471'], ['Андроновка', '473'], ['Перово', '474'], ['Плющево', '475'], ['Вешняки', '476'], ['Выхино', '477'], ['Рязанский проспект', '478'], ['Ухтомская', '479'], ['Люберцы', '480'], ['Панки', '481'], ['Томилино', '482'], ['Красково', '483'], ['Котельники', '484'], ['Отдых', '488'], ['Кратово', '489'], ['Есенинская', '490'], ['Фабричная', '491'], ['Раменское', '492'], ['Ипподром', '493'], ['Апрелевка', '494'], ['Победа', '495'], ['Крёкшино', '496'], ['Санино', '497'], ['Кокошкино', '498'], ['Толстопальцево', '499'], ['Лесной Городок', '500'], ['Внуково', '501'], ['Мичуринец', '502'], ['Переделкино', '503'], ['Солнечная', '504'], ['Говорово', '505'], ['Очаково', '506'], ['Аминьевская', '507'], ['Матвеевская', '508'], ['Минская', '509'], ['Кутузовская', '511'], ['Беговая', '513'], ['Белорусская', '514'], ['Рижская', '517'], ['Курская', '519'], ['Чухлинка', '522'], ['Кусково', '523'], ['Новогиреево', '524'], ['Реутов', '525'], ['Никольское', '526'], ['Салтыковская', '527'], ['Кучино', '528'], ['Ольгино', '529'], ['Железнодорожная', '530'], ['Физтех', '533'], ['Аэропорт Внуково', '535'], ['Пыхтино', '536'], ['Марьина Роща', '537'], ], "Казанский": [ ['Северный Вокзал', '314'], ['Яшьлек', '315'], ['Козья слобода', '316'], ['Кремлёвская', '317'], ['Площадь Тукая', '318'], ['Суконная слобода', '319'], ['Аметьево', '320'], ['Горки', '321'], ['Проспект Победы', '322'], ['Дубравная', '368'], ], "Петербургский": [ ['Девяткино', '167'], ['Гражданский проспект', '168'], ['Академическая', '169'], ['Политехническая', '170'], ['Площадь Мужества', '171'], ['Лесная', '172'], ['Выборгская', '173'], ['Площадь Ленина', '174'], ['Чернышевская', '175'], ['Площадь Восстания', '176'], ['Владимирская', '177'], ['Пушкинская', '178'], ['Технологический институт', '179'], ['Балтийская', '180'], ['Нарвская', '181'], ['Кировский завод', '182'], ['Автово', '183'], ['Ленинский проспект', '184'], ['Проспект Ветеранов', '185'], ['Парнас', '186'], ['Проспект Просвещения', '187'], ['Озерки', '188'], ['Удельная', '189'], ['Пионерская', '190'], ['Черная речка', '191'], ['Петроградская', '192'], ['Горьковская', '193'], ['Невский проспект', '194'], ['Сенная площадь', '195'], ['Фрунзенская', '197'], ['Московские ворота', '198'], ['Электросила', '199'], ['Парк Победы', '200'], ['Московская', '201'], ['Звездная', '202'], ['Купчино', '203'], ['Приморская', '204'], ['Василеостровская', '205'], ['Гостиный двор', '206'], ['Маяковская', '207'], ['Площадь Александра Невского', '208'], ['Елизаровская', '210'], ['Ломоносовская', '211'], ['Пролетарская', '212'], ['Обухово', '213'], ['Рыбацкое', '214'], ['Комендантский проспект', '215'], ['Старая Деревня', '216'], ['Крестовский остров', '217'], ['Чкаловская', '218'], ['Спортивная', '219'], ['Садовая', '220'], ['Достоевская', '221'], ['Лиговский проспект', '222'], ['Новочеркасская', '224'], ['Ладожская', '225'], ['Проспект Большевиков', '226'], ['Улица Дыбенко', '227'], ['Волковская', '230'], ['Звенигородская', '231'], ['Спасская', '232'], ['Обводный канал', '241'], ['Адмиралтейская', '242'], ['Международная', '246'], ['Бухарестская', '247'], ['Проспект Славы', '357'], ['Беговая', '355'], ['Зенит', '356'], ['Проспект Славы', '357'], ['Дунайская', '358'], ['Шушары', '359'], ['Горный институт', '382'], ], "Самарский": [ ['Российская', '261'], ['Московская', '262'], ['Гагаринская', '263'], ['Спортивная', '264'], ['Советская', '265'], ['Победа', '266'], ['Безымянка', '267'], ['Кировская', '268'], ['Юнгородок', '269'], ['Победа', '270'], ['Алабинская', '312'], ], "Екатеринбургский": [ ['Проспект Космонавтов', '340'], ['Уралмаш', '341'], ['Машиностроителей', '342'], ['Уральская', '343'], ['Динамо', '343'], ['Площадь 1905 года', '345'], ['Геологическая', '346'], ['Чкаловская', '347'], ['Ботаническая', '348'], ], "Новосибирский": [ ['Заельцовская', '248'], ['Гагаринская', '249'], ['Красный Проспект', '250'], ['Сибирская', '251'], ['Площадь Ленина', '252'], ['Октябрьская', '253'], ['Речной Вокзал', '254'], ['Студенческая', '255'], ['Площадь Маркса', '256'], ['Площадь Гарина-Михайловского', '257'], ['Маршала Покрышкина', '258'], ['Березовая Роща', '259'], ['Золотая Нива', '260'], ], "Нижегородский": [ ['Горьковская', '323'], ['Московская', '324'], ['Чкаловская', '325'], ['Ленинская', '326'], ['Заречная', '327'], ['Двигатель Революции', '328'], ['Пролетарская', '329'], ['Автозаводская', '330'], ['Комсомольская', '331'], ['Кировская', '332'], ['Парк культуры', '333'], ['Канавинская', '334'], ['Бурнаковская', '335'], ['Буревестник', '335'], ['Стрелка', '360'] ], } ================================================ FILE: cianparser/definers/__init__.py ================================================ ================================================ FILE: cianparser/definers/definer_cities_id.py ================================================ import time import requests from bs4 import BeautifulSoup import pymorphy2 import collections import csv import cloudscraper ParseCityNames = collections.namedtuple( 'ParseResults', { 'location_name', 'city_id', } ) class Client: def __init__(self, start_location_id=1, end_location_id=20): self.session = cloudscraper.create_scraper() self.session.headers = {'Accept-Language': 'en'} self.cities = [] self.cities_set = set() self.start_location_id = start_location_id self.end_location_id = end_location_id def define_city(self, html, location_id: int): soup = BeautifulSoup(html, 'html.parser') offers = soup.select("div[data-name='HeaderDefault']") if len(offers) == 0: print("_" + " " + "***") return self.cities title = offers[0].text city = title.lower()[title.lower().find("снять квартиру в ") + len("снять квартиру в "):title.lower().find( " на длительный срок")] if ("в России" in title or "АрендаСнять" not in title or ("области" in city or "крае" in city or "республике" in city or "округе" in city or "россии" in city or "кабардино" in city or "карачаево" in city or "дагестан" in city or "осетии" in city or "ненецком ао" in city or "ямало-ненецком ао" in city or "чукотском ао" in city or "ханты-мансийском ао" in city or "чувашии" in city) ): print("_" + " " + str(location_id)) return self.cities morph = pymorphy2.MorphAnalyzer() city = morph.parse(city)[0].normal_form.title() print(city + " " + str(location_id)) if city not in self.cities_set: self.cities_set.add(city) self.cities.append((city, location_id)) self.save_results() return self.cities def define_all_cities(self): for location_id in range(self.start_location_id, self.end_location_id+1): path = f'https://www.cian.ru/cat.php?deal_type=rent&engine_version=2&offer_type=flat&p=1®ion={location_id}&type=4' response = requests.get(path) html = response.text self.define_city(html, location_id) time.sleep(2) self.cities = sorted(self.cities, key=lambda x: x[0]) def save_results(self): cities_result = [] cities_result.append(ParseCityNames( location_name='location_name', city_id='city_id', )) for city_couple in self.cities: cities_result.append(ParseCityNames( location_name=city_couple[0], city_id=city_couple[1], )) path = f"cities_{self.start_location_id}_{self.end_location_id}.csv" with open(path, "w") as f: writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL) for item in self.cities: writer.writerow(item) if __name__ == '__main__': definer = Client(start_location_id=6000, end_location_id=7000) definer.define_all_cities() ================================================ FILE: cianparser/definers/definer_metro_id.py ================================================ import time import requests from bs4 import BeautifulSoup import collections import csv import cloudscraper ParseMetroNames = collections.namedtuple( 'ParseResults', { 'city', 'metro_name', 'metro_id', } ) class Client: def __init__(self, start_metro_id=1, end_metro_id=20): self.session = cloudscraper.create_scraper() self.session.headers = {'Accept-Language': 'en'} self.metro_stations = [] self.metro_set = set() self.start_metro_id = start_metro_id self.end_metro_id = end_metro_id def define_metro(self, html, metro_id: int): soup = BeautifulSoup(html, 'html.parser') offers = soup.select("div[data-name='GeneralInfoSectionRowComponent']") if len(offers) == 0: print("_" + " " + "***") return self.metro_stations address = offers[1].text if ", м." not in address: for offer in offers: if ", м." in offer.text: address = offer.text if address.find(", м.") == 0: print("_" + " " + "***" + "somethins wrong") city = "Unknown" if "Москва" in address: city = "Москва" if "Казань" in address: city = "Казань" if "Санкт-Петербург" in address: city = "Санкт-Петербург" if "Самара" in address: city = "Самара" if "Екатеринбург" in address: city = "Екатеринбург" if "Новосибирск" in address: city = "Новосибирск" if "Нижний Новгород" in address: city = "Нижний Новгород" metro = address[address.find(", м.") + len(", м. "):].split(", ")[0] print(f"{city}, {metro}, {str(metro_id)}") if metro not in self.metro_set: self.metro_set.add(metro) self.metro_stations.append((city, metro, metro_id)) self.save_results() return self.metro_stations def define_all_metro_stations(self): for metro_id in range(self.start_metro_id, self.end_metro_id+1): path = f'https://www.cian.ru/cat.php?deal_type=rent&engine_version=2&offer_type=flat&p=1®ion=1&type=4&metro[0]={metro_id}' response = requests.get(path) html = response.text self.define_metro(html, metro_id) time.sleep(2) self.metro_stations = sorted(self.metro_stations, key=lambda x: x[0]) def save_results(self): metro_stations_result = [ParseMetroNames( city='city', metro_name='metro_name', metro_id='metro_id', )] for metro_couple in self.metro_stations: metro_stations_result.append(ParseMetroNames( city=metro_couple[0], metro_name=metro_couple[1], metro_id=metro_couple[2], )) path = f"metro_stations_{self.start_metro_id}_{self.end_metro_id}.csv" with open(path, "w") as f: writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL) for item in self.metro_stations: writer.writerow(item) if __name__ == '__main__': definer = Client(start_metro_id=1, end_metro_id=10) definer.define_all_metro_stations() ================================================ FILE: cianparser/flat/list.py ================================================ import bs4 import time import pathlib from datetime import datetime from transliterate import translit from cianparser.constants import FILE_NAME_FLAT_FORMAT from cianparser.helpers import union_dicts, define_author, define_location_data, define_specification_data, define_deal_url_id, define_price_data from cianparser.flat.page import FlatPageParser from cianparser.base_list import BaseListPageParser class FlatListPageParser(BaseListPageParser): def build_file_path(self): now_time = datetime.now().strftime("%d_%b_%Y_%H_%M_%S_%f") file_name = FILE_NAME_FLAT_FORMAT.format(self.accommodation_type, self.deal_type, self.start_page, self.end_page, translit(self.location_name.lower(), reversed=True), now_time) return pathlib.Path(pathlib.Path.cwd(), file_name.replace("'", "")) def parse_list_offers_page(self, html, page_number: int, count_of_pages: int, attempt_number: int): list_soup = bs4.BeautifulSoup(html, 'html.parser') if list_soup.text.find("Captcha") > 0: print(f"\r{page_number} page: there is CAPTCHA... failed to parse page...") return False, attempt_number + 1, True header = list_soup.select("div[data-name='HeaderDefault']") if len(header) == 0: return False, attempt_number + 1, False offers = list_soup.select("article[data-name='CardComponent']") print("") print(f"\r {page_number} page: {len(offers)} offers", end="\r", flush=True) if page_number == self.start_page and attempt_number == 0: print(f"Collecting information from pages with list of offers", end="\n") for ind, offer in enumerate(offers): self.parse_offer(offer=offer) self.print_parse_progress(page_number=page_number, count_of_pages=count_of_pages, offers=offers, ind=ind) time.sleep(2) return True, 0, False def parse_offer(self, offer): common_data = dict() common_data["url"] = offer.select("div[data-name='LinkArea']")[0].select("a")[0].get('href') common_data["location"] = self.location_name common_data["deal_type"] = self.deal_type common_data["accommodation_type"] = self.accommodation_type author_data = define_author(block=offer) location_data = define_location_data(block=offer, is_sale=self.is_sale()) price_data = define_price_data(block=offer) specification_data = define_specification_data(block=offer) if define_deal_url_id(common_data["url"]) in self.result_set: return page_data = dict() if self.with_extra_data: flat_parser = FlatPageParser(session=self.session, url=common_data["url"]) page_data = flat_parser.parse_page() time.sleep(4) self.count_parsed_offers += 1 self.define_average_price(price_data=price_data) self.result_set.add(define_deal_url_id(common_data["url"])) self.result.append(union_dicts(author_data, common_data, specification_data, price_data, page_data, location_data)) if self.with_saving_csv: self.save_results() ================================================ FILE: cianparser/flat/page.py ================================================ import bs4 import re import time class FlatPageParser: def __init__(self, session, url): self.session = session self.url = url def __load_page__(self): res = self.session.get(self.url) if res.status_code == 429: time.sleep(10) res.raise_for_status() self.offer_page_html = res.text self.offer_page_soup = bs4.BeautifulSoup(self.offer_page_html, 'html.parser') def __parse_flat_offer_page_json__(self): page_data = { "year_of_construction": -1, "object_type": -1, "house_material_type": -1, "heating_type": -1, "finish_type": -1, "living_meters": -1, "kitchen_meters": -1, "floor": -1, "floors_count": -1, "phone": "", } spans = self.offer_page_soup.select("span") for index, span in enumerate(spans): if "Тип жилья" == span.text: page_data["object_type"] = spans[index + 1].text if "Тип дома" == span.text: page_data["house_material_type"] = spans[index + 1].text if "Отопление" == span.text: page_data["heating_type"] = spans[index + 1].text if "Отделка" == span.text: page_data["finish_type"] = spans[index + 1].text if "Площадь кухни" == span.text: page_data["kitchen_meters"] = spans[index + 1].text if "Жилая площадь" == span.text: page_data["living_meters"] = spans[index + 1].text if "Год постройки" in span.text: page_data["year_of_construction"] = spans[index + 1].text if "Год сдачи" in span.text: page_data["year_of_construction"] = spans[index + 1].text if "Этаж" == span.text: ints = re.findall(r'\d+', spans[index + 1].text) if len(ints) == 2: page_data["floor"] = int(ints[0]) page_data["floors_count"] = int(ints[1]) if "+7" in self.offer_page_html: page_data["phone"] = self.offer_page_html[self.offer_page_html.find("+7"): self.offer_page_html.find("+7") + 16].split('"')[0]. \ replace(" ", ""). \ replace("-", "") return page_data def parse_page(self): self.__load_page__() return self.__parse_flat_offer_page_json__() ================================================ FILE: cianparser/helpers.py ================================================ import re import itertools from cianparser.constants import STREET_TYPES, NOT_STREET_ADDRESS_ELEMENTS, FLOATS_NUMBERS_REG_EXPRESSION def union_dicts(*dicts): return dict(itertools.chain.from_iterable(dct.items() for dct in dicts)) def define_rooms_count(description): if "1-комн" in description or "Студия" in description: rooms_count = 1 elif "2-комн" in description: rooms_count = 2 elif "3-комн" in description: rooms_count = 3 elif "4-комн" in description: rooms_count = 4 elif "5-комн" in description: rooms_count = 5 else: rooms_count = -1 return rooms_count def define_deal_url_id(url: str): url_path_elements = url.split("/") if len(url_path_elements[-1]) > 3: return url_path_elements[-1] if len(url_path_elements[-2]) > 3: return url_path_elements[-2] return "-1" def define_author(block): spans = block.select("div")[0].select("span") author_data = { "author": "", "author_type": "", } for index, span in enumerate(spans): if "Агентство недвижимости" in span: author_data["author"] = spans[index + 1].text.replace(",", ".").strip() author_data["author_type"] = "real_estate_agent" return author_data for index, span in enumerate(spans): if "Собственник" in span: author_data["author"] = spans[index + 1].text author_data["author_type"] = "homeowner" return author_data for index, span in enumerate(spans): if "Риелтор" in span: author_data["author"] = spans[index + 1].text author_data["author_type"] = "realtor" return author_data for index, span in enumerate(spans): if "Ук・оф.Представитель" in span: author_data["author"] = spans[index + 1].text author_data["author_type"] = "official_representative" return author_data for index, span in enumerate(spans): if "Представитель застройщика" in span: author_data["author"] = spans[index + 1].text author_data["author_type"] = "representative_developer" return author_data for index, span in enumerate(spans): if "Застройщик" in span: author_data["author"] = spans[index + 1].text author_data["author_type"] = "developer" return author_data for index, span in enumerate(spans): if "ID" in span.text: author_data["author"] = span.text author_data["author_type"] = "unknown" return author_data return author_data def parse_location_data(block): general_info_sections = block.select_one("div[data-name='LinkArea']").select("div[data-name='GeneralInfoSectionRowComponent']") location_data = dict() location_data["district"] = "" location_data["underground"] = "" location_data["street"] = "" location_data["house_number"] = "" for section in general_info_sections: geo_labels = section.select("a[data-name='GeoLabel']") # if len(geo_labels) > 1: # print("\n\n", location_data["street"] == "",geo_labels[-2].text, "|||", geo_labels[-1].text) for index, label in enumerate(geo_labels): if "м. " in label.text: location_data["underground"] = label.text if "р-н" in label.text or "поселение" in label.text: location_data["district"] = label.text if any(street_type in label.text.lower() for street_type in STREET_TYPES): location_data["street"] = label.text if len(geo_labels) > index + 1 and any(chr.isdigit() for chr in geo_labels[index + 1].text): location_data["house_number"] = geo_labels[index + 1].text return location_data def define_location_data(block, is_sale): elements = block.select_one("div[data-name='LinkArea']").select("div[data-name='GeneralInfoSectionRowComponent']") location_data = dict() location_data["district"] = "" location_data["street"] = "" location_data["house_number"] = "" location_data["underground"] = "" if is_sale: location_data["residential_complex"] = "" for index, element in enumerate(elements): if ("ЖК" in element.text) and ("«" in element.text) and ("»" in element.text): location_data["residential_complex"] = element.text.split("«")[1].split("»")[0] if "р-н" in element.text and len(element.text) < 250: address_elements = element.text.split(",") if len(address_elements) < 2: continue if "ЖК" in address_elements[0] and "«" in address_elements[0] and "»" in address_elements[0]: location_data["residential_complex"] = address_elements[0].split("«")[1].split("»")[0] if ", м. " in element.text: location_data["underground"] = element.text.split(", м. ")[1] if "," in location_data["underground"]: location_data["underground"] = location_data["underground"].split(",")[0] if (any(chr.isdigit() for chr in address_elements[-1]) and "жк" not in address_elements[-1].lower() and not any(street_type in address_elements[-1].lower() for street_type in STREET_TYPES)) and len( address_elements[-1]) < 10: location_data["house_number"] = address_elements[-1].strip() for ind, elem in enumerate(address_elements): if "р-н" in elem: district = elem.replace("р-н", "").strip() location_data["district"] = district if "ЖК" in address_elements[-1]: location_data["residential_complex"] = address_elements[-1].strip() if "ЖК" in address_elements[-2]: location_data["residential_complex"] = address_elements[-2].strip() for street_type in STREET_TYPES: if street_type in address_elements[-1]: location_data["street"] = address_elements[-1].strip() if street_type == "улица": location_data["street"] = location_data["street"].replace("улица", "") return location_data if street_type in address_elements[-2]: location_data["street"] = address_elements[-2].strip() if street_type == "улица": location_data["street"] = location_data["street"].replace("улица", "") return location_data for k, after_district_address_element in enumerate(address_elements[ind + 1:]): if len(list(set(after_district_address_element.split(" ")).intersection( NOT_STREET_ADDRESS_ELEMENTS))) != 0: continue if len(after_district_address_element.strip().replace(" ", "")) < 4: continue location_data["street"] = after_district_address_element.strip() return location_data return location_data if location_data["district"] == "": for index, element in enumerate(elements): if ", м. " in element.text and len(element.text) < 250: location_data["underground"] = element.text.split(", м. ")[1] if "," in location_data["underground"]: location_data["underground"] = location_data["underground"].split(",")[0] address_elements = element.text.split(",") if len(address_elements) < 2: continue if "ЖК" in address_elements[-1]: location_data["residential_complex"] = address_elements[-1].strip() if "ЖК" in address_elements[-2]: location_data["residential_complex"] = address_elements[-2].strip() if (any(chr.isdigit() for chr in address_elements[-1]) and "жк" not in address_elements[ -1].lower() and not any( street_type in address_elements[-1].lower() for street_type in STREET_TYPES)) and len( address_elements[-1]) < 10: location_data["house_number"] = address_elements[-1].strip() for street_type in STREET_TYPES: if street_type in address_elements[-1]: location_data["street"] = address_elements[-1].strip() if street_type == "улица": location_data["street"] = location_data["street"].replace("улица", "") return location_data if street_type in address_elements[-2]: location_data["street"] = address_elements[-2].strip() if street_type == "улица": location_data["street"] = location_data["street"].replace("улица", "") return location_data for street_type in STREET_TYPES: if (", " + street_type + " " in element.text) or (" " + street_type + ", " in element.text): address_elements = element.text.split(",") if len(address_elements) < 3: continue if (any(chr.isdigit() for chr in address_elements[-1]) and "жк" not in address_elements[ -1].lower() and not any( street_type in address_elements[-1].lower() for street_type in STREET_TYPES)) and len( address_elements[-1]) < 10: location_data["house_number"] = address_elements[-1].strip() if street_type in address_elements[-1]: location_data["street"] = address_elements[-1].strip() if street_type == "улица": location_data["street"] = location_data["street"].replace("улица", "") location_data["district"] = address_elements[-2].strip() return location_data if street_type in address_elements[-2]: location_data["street"] = address_elements[-2].strip() if street_type == "улица": location_data["street"] = location_data["street"].replace("улица", "") location_data["district"] = address_elements[-3].strip() return location_data return location_data def define_price_data(block): elements = block.select("div[data-name='LinkArea']")[0]. \ select("span[data-mark='MainPrice']") price_data = { "price_per_month": -1, "commissions": 0, } for element in elements: if "₽/мес" in element.text: price_description = element.text price_data["price_per_month"] = int( "".join(price_description[:price_description.find("₽/мес") - 1].split())) if "%" in price_description: price_data["commissions"] = int( price_description[price_description.find("%") - 2:price_description.find("%")].replace(" ", "")) return price_data if "₽" in element.text and "млн" not in element.text: price_description = element.text price_data["price"] = int("".join(price_description[:price_description.find("₽") - 1].split())) return price_data return price_data def define_specification_data(block): specification_data = dict() specification_data["floor"] = -1 specification_data["floors_count"] = -1 specification_data["rooms_count"] = -1 specification_data["total_meters"] = -1 title = block.select("div[data-name='LinkArea']")[0].select("div[data-name='GeneralInfoSectionRowComponent']")[ 0].text common_properties = block.select("div[data-name='LinkArea']")[0]. \ select("div[data-name='GeneralInfoSectionRowComponent']")[0].text if common_properties.find("м²") is not None: total_meters = title[: common_properties.find("м²")].replace(",", ".") if len(re.findall(FLOATS_NUMBERS_REG_EXPRESSION, total_meters)) != 0: specification_data["total_meters"] = float( re.findall(FLOATS_NUMBERS_REG_EXPRESSION, total_meters)[-1].replace(" ", "").replace("-", "")) if "этаж" in common_properties: floor_per = common_properties[common_properties.rfind("этаж") - 7: common_properties.rfind("этаж")] floor_properties = floor_per.split("/") if len(floor_properties) == 2: ints = re.findall(r'\d+', floor_properties[0]) if len(ints) != 0: specification_data["floor"] = int(ints[-1]) ints = re.findall(r'\d+', floor_properties[1]) if len(ints) != 0: specification_data["floors_count"] = int(ints[-1]) specification_data["rooms_count"] = define_rooms_count(common_properties) return specification_data ================================================ FILE: cianparser/newobject/list.py ================================================ import bs4 import time import math import csv import pathlib from datetime import datetime from transliterate import translit import urllib.parse from cianparser.constants import FILE_NAME_NEWOBJECT_FORMAT from cianparser.helpers import union_dicts from cianparser.newobject.page import NewObjectPageParser class NewObjectListParser: def __init__(self, session, location_name: str, with_saving_csv=False): self.accommodation_type = "newobject" self.deal_type = "sale" self.session = session self.location_name = location_name self.with_saving_csv = with_saving_csv self.result = [] self.result_set = set() self.average_price = 0 self.count_parsed_offers = 0 self.start_page = 1 self.end_page = 50 self.file_path = self.build_file_path() def build_file_path(self): now_time = datetime.now().strftime("%d_%b_%Y_%H_%M_%S_%f") file_name = FILE_NAME_NEWOBJECT_FORMAT.format(self.accommodation_type, translit(self.location_name.lower(), reversed=True), now_time) return pathlib.Path(pathlib.Path.cwd(), file_name.replace("'", "")) def print_parse_progress(self, page_number, count_of_pages, offers, ind): total_planed_offers = len(offers) * count_of_pages print(f"\r {page_number - self.start_page + 1}" f" | {page_number} page with list: [" + "=>" * (ind + 1) + " " * (len(offers) - ind - 1) + "]" + f" {math.ceil((ind + 1) * 100 / len(offers))}" + "%" + f" | Count of all parsed: {self.count_parsed_offers}." f" Progress ratio: {math.ceil(self.count_parsed_offers * 100 / total_planed_offers)} %.", end="\r", flush=True) def parse_list_offers_page(self, html, page_number: int, count_of_pages: int, attempt_number: int): list_soup = bs4.BeautifulSoup(html, 'html.parser') if list_soup.text.find("Captcha") > 0: print(f"\r{page_number} page: there is CAPTCHA... failed to parse page...") return False, attempt_number + 1, True offers = list_soup.select("div[data-mark='GKCard']") print("") print(f"\r {page_number} page: {len(offers)} offers", end="\r", flush=True) if page_number == self.start_page and attempt_number == 0: print(f"Collecting information from pages with list of offers", end="\n") for ind, offer in enumerate(offers): self.parse_offer(offer=offer) self.print_parse_progress(page_number=page_number, count_of_pages=count_of_pages, offers=offers, ind=ind) time.sleep(2) return True, 0, False def parse_offer(self, offer): common_data = dict() common_data["name"] = offer.select_one("span[data-mark='Text']").text common_data["location"] = self.location_name common_data["accommodation_type"] = self.accommodation_type common_data["url"] = "https://" + urllib.parse.urlparse(offer.select_one("a[data-mark='Link']").get('href')).netloc common_data["full_full_location_address"] = offer.select_one("div[data-mark='CellAddressBlock']").text if common_data["url"] in self.result_set: return flat_parser = NewObjectPageParser(session=self.session, url=common_data["url"]) page_data = flat_parser.parse_page() time.sleep(4) self.count_parsed_offers += 1 self.result_set.add(common_data["url"]) self.result.append(union_dicts(common_data, page_data)) if self.with_saving_csv: self.save_results() def save_results(self): keys = self.result[0].keys() with open(self.file_path, 'w', newline='', encoding='utf-8') as output_file: dict_writer = csv.DictWriter(output_file, keys, delimiter=';') dict_writer.writeheader() dict_writer.writerows(self.result) ================================================ FILE: cianparser/newobject/page.py ================================================ import bs4 import re import time class NewObjectPageParser: def __init__(self, session, url): self.session = session self.url = url def __load_page__(self): res = self.session.get(self.url) if res.status_code == 429: time.sleep(10) res.raise_for_status() self.offer_page_html = res.text self.offer_page_soup = bs4.BeautifulSoup(self.offer_page_html, 'html.parser') def parse_page(self): self.__load_page__() page_data = { "year_of_construction": -1, "house_material_type": -1, "finish_type": -1, "ceiling_height":-1, "class": -1, "parking_type": -1, "floors_from": -1, "floors_to": -1, } spans = self.offer_page_soup.select("span") for index, span in enumerate(spans): if "Срок сдачи" in span.text: page_data["year_of_construction"] = spans[index + 1].text if "Тип дома" == span.text: page_data["house_material_type"] = spans[index + 1].text if "Отделка" == span.text: page_data["finish_type"] = spans[index + 1].text if "Высота потолков" == span.text: page_data["ceiling_height"] = spans[index + 1].text if "Класс" == span.text: page_data["class"] = spans[index + 1].text if "Застройщик" in span.text and "Проектная декларация" in span.text: page_data["builder"] = span.text.split(".")[0] if "Парковка" == span.text: page_data["parking_type"] = spans[index + 1].text if "Этажность" == span.text: ints = re.findall(r'\d+', spans[index + 1].text) if len(ints) == 2: page_data["floors_from"] = int(ints[0]) page_data["floors_to"] = int(ints[1]) if len(ints) == 1: page_data["floors_from"] = int(ints[0]) page_data["floors_to"] = int(ints[0]) return page_data ================================================ FILE: cianparser/proxy_pool.py ================================================ import time import urllib.request import urllib.error import bs4 import random import socket class ProxyPool: def __init__(self, proxies): self.__proxy_pool__ = [] if proxies is None else proxies self.__current_proxy__ = None self.__page_html__ = None def __is_captcha__(self): page_soup = bs4.BeautifulSoup(self.__page_html__, 'html.parser') return page_soup.text.find("Captcha") > 0 def __is_available_proxy__(self, url, proxy): opener = urllib.request.build_opener(urllib.request.ProxyHandler({'https': proxy})) opener.addheaders = [('User-agent', 'Mozilla/5.0')] urllib.request.install_opener(opener) try: self.__page_html__ = urllib.request.urlopen(urllib.request.Request(url)) except Exception as detail: print(f"atas: {detail}..") return False return True def is_empty(self): return len(self.__proxy_pool__) == 0 def get_available_proxy(self, url): print("The process of checking the proxies... Search an available one among them...") socket.setdefaulttimeout(5) found_proxy = False while len(self.__proxy_pool__) > 0 and found_proxy is False: proxy = random.choice(self.__proxy_pool__) is_available = self.__is_available_proxy__(url, proxy) is_captcha = self.__is_captcha__() if is_available else None if not is_available or is_captcha: if is_captcha: print(f"proxy {proxy}: there is captcha.. trying another") else: print(f"proxy {proxy}: unavailable.. trying another..") self.__proxy_pool__.remove(proxy) time.sleep(4) continue print(f"proxy {proxy}: available.. stop searching") self.__current_proxy__, found_proxy = proxy, True if self.__current_proxy__ is None: print(f"there are not available proxies..", end="\n\n") return self.__current_proxy__ ================================================ FILE: cianparser/suburban/list.py ================================================ import bs4 import time import pathlib from datetime import datetime from transliterate import translit from cianparser.constants import FILE_NAME_SUBURBAN_FORMAT from cianparser.helpers import union_dicts, define_author, parse_location_data, define_price_data, define_deal_url_id from cianparser.suburban.page import SuburbanPageParser from cianparser.base_list import BaseListPageParser class SuburbanListPageParser(BaseListPageParser): def build_file_path(self): now_time = datetime.now().strftime("%d_%b_%Y_%H_%M_%S_%f") file_name = FILE_NAME_SUBURBAN_FORMAT.format(self.accommodation_type, self.object_type, self.deal_type, self.start_page, self.end_page, translit(self.location_name.lower(), reversed=True), now_time) return pathlib.Path(pathlib.Path.cwd(), file_name.replace("'", "")) def parse_list_offers_page(self, html, page_number: int, count_of_pages: int, attempt_number: int): list_soup = bs4.BeautifulSoup(html, 'html.parser') if list_soup.text.find("Captcha") > 0: print(f"\r{page_number} page: there is CAPTCHA... failed to parse page...") return False, attempt_number + 1, True header = list_soup.select("div[data-name='HeaderDefault']") if len(header) == 0: return False, attempt_number + 1, False offers = list_soup.select("article[data-name='CardComponent']") print("") print(f"\r {page_number} page: {len(offers)} offers", end="\r", flush=True) if page_number == self.start_page and attempt_number == 0: print(f"Collecting information from pages with list of offers", end="\n") for ind, offer in enumerate(offers): self.parse_offer(offer=offer) self.print_parse_progress(page_number=page_number, count_of_pages=count_of_pages, offers=offers, ind=ind) time.sleep(2) return True, 0, False def parse_offer(self, offer): common_data = dict() common_data["url"] = offer.select("div[data-name='LinkArea']")[0].select("a")[0].get('href') common_data["location"] = self.location_name common_data["deal_type"] = self.deal_type common_data["accommodation_type"] = self.accommodation_type common_data["suburban_type"] = self.object_type author_data = define_author(block=offer) location_data = parse_location_data(block=offer) price_data = define_price_data(block=offer) if define_deal_url_id(common_data["url"]) in self.result_set: return page_data = dict() if self.with_extra_data: suburban_parser = SuburbanPageParser(session=self.session, url=common_data["url"]) page_data = suburban_parser.parse_page() time.sleep(4) self.count_parsed_offers += 1 self.define_average_price(price_data=price_data) self.result_set.add(define_deal_url_id(common_data["url"])) self.result.append(union_dicts(author_data, common_data, price_data, page_data, location_data)) if self.with_saving_csv: self.save_results() ================================================ FILE: cianparser/suburban/page.py ================================================ import time import bs4 class SuburbanPageParser: def __init__(self, session, url): self.session = session self.url = url def __load_page__(self): res = self.session.get(self.url) if res.status_code == 429: time.sleep(10) res.raise_for_status() self.offer_page_html = res.text self.offer_page_soup = bs4.BeautifulSoup(self.offer_page_html, 'html.parser') def parse_page(self): self.__load_page__() page_data = { "year_of_construction": -1, "house_material_type": -1, "land_plot":-1, "land_plot_status": -1, "heating_type": -1, "gas_type":-1, "water_supply_type":-1, "sewage_system":-1, "bathroom":-1, "living_meters": -1, "floors_count": -1, "phone": "", } spans = self.offer_page_soup.select("span") for index, span in enumerate(spans): if "Материал дома" == span.text: page_data["house_material_type"] = spans[index + 1].text if "Участок" == span.text: page_data["land_plot"] = spans[index + 1].text if "Статус участка" == span.text: page_data["land_plot_status"] = spans[index + 1].text if "Отопление" == span.text: page_data["heating_type"] = spans[index + 1].text if "Газ" == span.text: page_data["gas_type"] = spans[index + 1].text if "Водоснабжение" == span.text: page_data["water_supply_type"] = spans[index + 1].text if "Канализация" == span.text: page_data["sewage_system"] = spans[index + 1].text if "Санузел" == span.text: page_data["bathroom"] = spans[index + 1].text if "Площадь кухни" == span.text: page_data["kitchen_meters"] = spans[index + 1].text if "Общая площадь" == span.text: page_data["living_meters"] = spans[index + 1].text if "Год постройки" in span.text: page_data["year_of_construction"] = spans[index + 1].text if "Год сдачи" in span.text: page_data["year_of_construction"] = spans[index + 1].text if "Этажей в доме" == span.text: page_data["floors_count"] = spans[index + 1].text if "+7" in self.offer_page_html: page_data["phone"] = self.offer_page_html[self.offer_page_html.find("+7"): self.offer_page_html.find("+7") + 16].split('"')[0]. \ replace(" ", ""). \ replace("-", "") return page_data ================================================ FILE: cianparser/url_builder.py ================================================ from cianparser.constants import * class URLBuilder: def __init__(self, is_newobject): self.url = BASE_URL self.add_newobject_postfix() if is_newobject else self.add_default_postfix() self.url += DEFAULT_PATH def add_default_postfix(self): self.url += DEFAULT_POSTFIX_PATH def add_newobject_postfix(self): self.url += NEWOBJECT_POSTFIX_PATH def get_url(self): return self.url def add_accommodation_type(self, accommodation_type): self.url += OFFER_TYPE_PATH.format(accommodation_type) def add_deal_type(self, deal_type): self.url += DEAL_TYPE_PATH.format(deal_type) def add_location(self, location_id): self.url += REGION_PATH.format(location_id) def add_room(self, rooms): rooms_path = "" if type(rooms) is tuple: for count_of_room in rooms: if type(count_of_room) is int: if 0 < count_of_room < 6: rooms_path += ROOM_PATH.format(count_of_room) elif type(count_of_room) is str: if count_of_room == "studio": rooms_path += STUDIO_PATH elif type(rooms) is int: if 0 < rooms < 6: rooms_path += ROOM_PATH.format(rooms) elif type(rooms) is str: if rooms == "studio": rooms_path += STUDIO_PATH elif rooms == "all": rooms_path = "" self.url += rooms_path def add_rent_period_type(self, rent_period_type): self.url += RENT_PERIOD_TYPE_PATH.format(rent_period_type) def add_object_suburban_type(self, object_type): self.url += OBJECT_TYPE_PATH.format(OBJECT_SUBURBAN_TYPES[object_type]) def add_additional_settings(self, additional_settings): if "object_type" in additional_settings.keys(): self.url += OBJECT_TYPE_PATH.format(OBJECT_TYPES[additional_settings["object_type"]]) if "is_by_homeowner" in additional_settings.keys() and additional_settings["is_by_homeowner"]: self.url += IS_ONLY_HOMEOWNER_PATH if "min_balconies" in additional_settings.keys(): self.url += MIN_BALCONIES_PATH.format(additional_settings["min_balconies"]) if "have_loggia" in additional_settings.keys() and additional_settings["have_loggia"]: self.url += HAVE_LOGGIA_PATH if "min_house_year" in additional_settings.keys(): self.url += MIN_HOUSE_YEAR_PATH.format(additional_settings["min_house_year"]) if "max_house_year" in additional_settings.keys(): self.url += MAX_HOUSE_YEAR_PATH.format(additional_settings["max_house_year"]) if "min_price" in additional_settings.keys(): self.url += MIN_PRICE_PATH.format(additional_settings["min_price"]) if "max_price" in additional_settings.keys(): self.url += MAX_PRICE_PATH.format(additional_settings["max_price"]) if "min_floor" in additional_settings.keys(): self.url += MIN_FLOOR_PATH.format(additional_settings["min_floor"]) if "max_floor" in additional_settings.keys(): self.url += MAX_FLOOR_PATH.format(additional_settings["max_floor"]) if "min_total_floor" in additional_settings.keys(): self.url += MIN_TOTAL_FLOOR_PATH.format(additional_settings["min_total_floor"]) if "max_total_floor" in additional_settings.keys(): self.url += MAX_TOTAL_FLOOR_PATH.format(additional_settings["max_total_floor"]) if "house_material_type" in additional_settings.keys(): self.url += HOUSE_MATERIAL_TYPE_PATH.format(additional_settings["house_material_type"]) if "metro" in additional_settings.keys(): if "metro_station" in additional_settings.keys(): if additional_settings["metro"] in METRO_STATIONS.keys(): for metro_station, metro_id in METRO_STATIONS[additional_settings["metro"]]: if additional_settings["metro_station"] == metro_station: self.url += METRO_ID_PATH.format(metro_id) if "metro_foot_minute" in additional_settings.keys(): self.url += METRO_FOOT_MINUTE_PATH.format(additional_settings["metro_foot_minute"]) if "flat_share" in additional_settings.keys(): self.url += FLAT_SHARE_PATH.format(additional_settings["flat_share"]) if "only_flat" in additional_settings.keys(): if additional_settings["only_flat"]: self.url += ONLY_FLAT_PATH.format(1) if "only_apartment" in additional_settings.keys(): if additional_settings["only_apartment"]: self.url += APARTMENT_PATH.format(1) if "sort_by" in additional_settings.keys(): if additional_settings["sort_by"] == IS_SORT_BY_PRICE_FROM_MIN_TO_MAX_PATH: self.url += SORT_BY_PRICE_FROM_MIN_TO_MAX_PATH if additional_settings["sort_by"] == IS_SORT_BY_PRICE_FROM_MAX_TO_MIN_PATH: self.url += SORT_BY_PRICE_FROM_MAX_TO_MIN_PATH if additional_settings["sort_by"] == IS_SORT_BY_TOTAL_METERS_FROM_MAX_TO_MIN_PATH: self.url += SORT_BY_TOTAL_METERS_FROM_MAX_TO_MIN_PATH if additional_settings["sort_by"] == IS_SORT_BY_CREATION_DATA_FROM_NEWER_TO_OLDER_PATH: self.url += SORT_BY_CREATION_DATA_FROM_NEWER_TO_OLDER_PATH if additional_settings["sort_by"] == IS_SORT_BY_CREATION_DATA_FROM_OLDER_TO_NEWER_PATH: self.url += SORT_BY_CREATION_DATA_FROM_OLDER_TO_NEWER_PATH ================================================ FILE: setup.cfg ================================================ [metadata] name = cianparser version = 1.0.4 description = Parser information from Cian website url = https://github.com/lenarsaitov/cianparser author = Lenar Saitov author_email = lenarsaitov1@yandex.ru long_description = file: README.md license_file = MIT keywords = python parser requests cloudscraper beautifulsoup cian realstate ================================================ FILE: setup.py ================================================ from setuptools import setup with open("README.md", encoding="utf8") as file: read_me_description = file.read() setup( name='cianparser', version='1.0.4', description='Parser information from Cian website', url='https://github.com/lenarsaitov/cianparser', author='Lenar Saitov', author_email='lenarsaitov1@yandex.ru', license='MIT', packages=['cianparser', 'cianparser.flat', 'cianparser.newobject', 'cianparser.suburban'], long_description=read_me_description, long_description_content_type="text/markdown", classifiers=[ "Programming Language :: Python :: 3", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", ], keywords='python parser requests cloudscraper beautifulsoup cian realstate', install_requires=['cloudscraper', 'beautifulsoup4', 'transliterate', 'lxml', 'datetime'], )