Repository: TLYu0419/facebook_crawler Branch: main Commit: 82ec3ec46aa0 Files: 18 Total size: 131.1 KB Directory structure: gitextract_m2qqy_6g/ ├── .github/ │ └── workflows/ │ └── pythonpublish.yml ├── .gitignore ├── LICENSE ├── README.md ├── main.py ├── paser.py ├── requester.py ├── requirements.txt ├── sample/ │ ├── 20221013_Sample.ipynb │ ├── FansPages.ipynb │ └── Group.ipynb ├── setup.py ├── tests/ │ ├── test_facebook_crawler.py │ ├── test_page_parser.py │ ├── test_post_parser.py │ ├── test_requester.py │ └── test_utils.py └── utils.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/workflows/pythonpublish.yml ================================================ name: Upload Python Package on: release: types: [created] jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v1 - name: Set up Python uses: actions/setup-python@v1 with: python-version: '3.x' - name: Install dependencies run: | python -m pip install --upgrade pip pip install setuptools wheel twine - name: Build and publish env: TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }} TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }} run: | python setup.py sdist bdist_wheel twine upload dist/* ================================================ FILE: .gitignore ================================================ develop/ .ipynb_checkpoints/ .vscode/ *egg-info/ .idea/ venv2/ build/ dist/ __pycache__/ data/ ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2021 tlyu0419 Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # Facebook_Crawler [](https://pepy.tech/project/facebook-crawler) [](https://pepy.tech/project/facebook-crawler) [](https://pepy.tech/project/facebook-crawler) ## What's this? This python package aims to help people who need to collect and analyze the public Fanspages or Groups data from Facebook with ease and efficiency. Here are the three big points of this project: 1. Private: You don't need to log in to your account. 2. Easy: Just key in the link of Fanspage or group and the target date, and it will work. 3. Efficient: It collects the data through the requests package directly instead of opening another browser. 這個 Python 套件旨在幫助使用者輕鬆且快速的收集 Facebook 公開粉絲頁和公開社團的資料,藉以進行後續的分析。 以下是本專案的 3 個重點: 1. 隱私: 不需要登入你個人的帳號密碼 2. 簡單: 僅需輸入粉絲頁/社團的網址和停止的日期就可以開始執行程式 3. 高效: 透過 requests 直接向伺服器請求資料,不需另外開啟一個新的瀏覽器 ## Quickstart ### Install ```pip pip install -U facebook-crawler ``` ### Usage - Facebook Fanspage ```python import facebook_crawler pageurl= 'https://www.facebook.com/diudiu333' facebook_crawler.Crawl_PagePosts(pageurl=pageurl, until_date='2021-01-01') ```  - Group ```python import facebook_crawler groupurl = 'https://www.facebook.com/groups/pythontw' facebook_crawler.Crawl_GroupPosts(groupurl, until_date='2021-01-01') ```  ## FAQ - **How to get the comments or replies to the posts?** > Please write an Email to me and tell me your project goal. Thanks! - **How can I find out the post's link through the data?** > You can add the string 'https://www.facebook.com' in front of the POSTID, and it's just its post link. So, for example, if the POSTID is 123456789, and its link is 'https://www.facebook.com/12345679'. - **Can I directly collect the data in the specific time period?** > No! This is the same as the behavior when we use Facebook. We need to collect the data from the newest posts to the older posts. ## License [MIT License](https://github.com/TLYu0419/facebook_crawler/blob/main/LICENSE) ## Contribution [](https://payment.ecpay.com.tw/QuickCollect/PayData?GcM4iJGUeCvhY%2fdFqqQ%2bFAyf3uA10KRo%2fqzP4DWtVcw%3d) A donation is not the limitation to utilizing this package, but it would be great to have your support. Either donate, star or fork are good methods to support me keep maintaining and developing this project. Thanks to these donors' help, due to their kind help, this project could keep maintained and developed. **贊助不是使用這個套件的必要條件**,但如能獲得你的支持我將會非常感謝。不論是贊助、給予星星或分享都是很好的支持方式,幫助我繼續維護和開發這個專案 由於這些捐助者的幫助,由於他們的慷慨的幫助,這個項目才得以持續維護和發展 - Universities - [Department of Social Work. The Chinese University of Hong Kong(香港中文大學社會工作學系)](https://web.swk.cuhk.edu.hk/zh-tw/) - [Education, Graduate School of Curriculum and Instructional Communications Technology. National Taipei University.(國立台北教育大學課程與教學傳播科技研究所)](https://cict.ntue.edu.tw/?locale=zh_tw) - [Department of Dusiness Administration. Chung Hua University.(中華大學企業管理學系)](https://ba.chu.edu.tw/?Lang=en) ## Contact Info - Author: TENG-LIN YU - Email: tlyu0419@gmail.com - Facebook: https://www.facebook.com/tlyu0419 - PYPI: https://pypi.org/project/facebook-crawler/ - Github: https://github.com/TLYu0419/facebook_crawler ## Log - 0.028: Modularized the crawler function. - 0.0.26 1. Auto changes the cookie after it's expired to keep crawling data without changing IP. ================================================ FILE: main.py ================================================ from paser import _parse_category, _parse_pagename, _parse_creation_time, _parse_pagetype, _parse_likes, _parse_docid, _parse_pageurl from paser import _parse_entryPoint, _parse_identifier, _parse_docid, _parse_composite_nojs, _parse_composite_graphql, _parse_relatedpages, _parse_pageinfo from requester import _get_homepage, _get_posts, _get_headers from utils import _init_request_vars from bs4 import BeautifulSoup import os import re import json import time import tqdm import pandas as pd import pickle import datetime import warnings warnings.filterwarnings("ignore") def Crawl_PagePosts(pageurl, until_date='2018-01-01', cursor=''): # initial request variables df, cursor, max_date, break_times = _init_request_vars(cursor) # get headers headers = _get_headers(pageurl) # Get pageid, postid and entryPoint from homepage_response homepage_response = _get_homepage(pageurl, headers) entryPoint = _parse_entryPoint(homepage_response) identifier = _parse_identifier(entryPoint, homepage_response) docid = _parse_docid(entryPoint, homepage_response) # Keep crawling post until reach the until_date while max_date >= until_date: try: # Get posts by identifier, docid and entryPoint resp = _get_posts(headers, identifier, entryPoint, docid, cursor) if entryPoint == 'nojs': ndf, max_date, cursor = _parse_composite_nojs(resp) df.append(ndf) else: ndf, max_date, cursor = _parse_composite_graphql(resp) df.append(ndf) # Test # print(ndf.shape[0]) break_times = 0 except: # print(resp.json()[:3000]) try: if resp.json()['data']['node']['timeline_feed_units']['page_info']['has_next_page'] == False: print('The posts of the page has run over!') break except: pass print('Break Times {}: Something went wrong with this request. Sleep 20 seconds and send request again.'.format( break_times)) print('REQUEST LOG >> pageid: {}, docid: {}, cursor: {}'.format( identifier, docid, cursor)) print('RESPONSE LOG: ', resp.text[:3000]) print('================================================') break_times += 1 if break_times > 15: print('Please check your target page/group has up to date.') print('If so, you can ignore this break time message, if not, please change your Internet IP and run this crawler again.') break time.sleep(20) # Get new headers headers = _get_headers(pageurl) # Concat all dataframes df = pd.concat(df, ignore_index=True) df['UPDATETIME'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") return df def Crawl_GroupPosts(pageurl, until_date='2022-01-01'): df = Crawl_PagePosts(pageurl, until_date) return df def Crawl_RelatedPages(seedpages, rounds): # init df = pd.DataFrame(data=[], columns=['SOURCE', 'TARGET', 'ROUND']) pageurls = list(set(seedpages)) crawled_list = list(set(df['SOURCE'])) headers = _get_headers(pageurls[0]) for i in range(rounds): print('Round {} started at: {}!'.format( i, datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))) for pageurl in tqdm(pageurls): if pageurl not in crawled_list: try: homepage_response = _get_homepage( pageurl=pageurl, headers=headers) if 'Sorry, something went wrong.' not in homepage_response.text: entryPoint = _parse_entryPoint(homepage_response) identifier = _parse_identifier( entryPoint, homepage_response) relatedpages = _parse_relatedpages( homepage_response, entryPoint, identifier) ndf = pd.DataFrame({'SOURCE': homepage_response.url, 'TARGET': relatedpages, 'ROUND': i}) df = pd.concat([df, ndf], ignore_index=True) except: pass # print('ERROE: {}'.format(pageurl)) pageurls = list(set(df['TARGET'])) crawled_list = list(set(df['SOURCE'])) return df def Crawl_PageInfo(pagenum, pageurl): break_times = 0 global headers while True: try: homepage_response = _get_homepage(pageurl, headers) pageinfo = _parse_pageinfo(homepage_response) with open('data/pageinfo/' + str(pagenum) + '.pickle', "wb") as fp: pickle.dump(pageinfo, fp) break except: break_times = break_times + 1 if break_times >= 5: break time.sleep(5) headers = _get_headers(pageurl=pageurl) if __name__ == '__main__': os.makedirs('data/', exist_ok=True) # ===== fb_api_req_friendly_name: ProfileCometTimelineFeedRefetchQuery ==== pageurl = 'https://www.facebook.com/Gooaye' # 股癌 Gooaye: 30.4萬追蹤, pageurl = 'https://www.facebook.com/StockOldBull' # 股海老牛: 16萬 pageurl = 'https://www.facebook.com/twherohan' pageurl = 'https://www.facebook.com/diudiu333' pageurl = 'https://www.facebook.com/chengwentsan' pageurl = 'https://www.facebook.com/MaYingjeou' pageurl = 'https://www.facebook.com/roberttawikofficial' pageurl = 'https://www.facebook.com/NizamAbTitingan' pageurl = 'https://www.facebook.com/joebiden' # ==== fb_api_req_friendly_name: CometModernPageFeedPaginationQuery ==== pageurl = 'https://www.facebook.com/ebcmoney/' # 東森財經: 81萬追蹤 pageurl = 'https://www.facebook.com/moneyweekly.tw/' # 理財周刊: 36.3萬 pageurl = 'https://www.facebook.com/cmoneyapp/' # CMoney 理財寶: 84.2萬 pageurl = 'https://www.facebook.com/emily0806' # 艾蜜莉-自由之路: 20.9萬追蹤 pageurl = 'https://www.facebook.com/imoney889/' # 林恩如-飆股女王: 10.2萬 pageurl = 'https://www.facebook.com/wealth1974/' # 財訊: 17.5萬 pageurl = 'https://www.facebook.com/smart16888/' # 郭莉芳理財講堂: 1.6萬 pageurl = 'https://www.facebook.com/smartmonthly/' # Smart 智富月刊: 52.6萬 pageurl = 'https://www.facebook.com/ezmoney.tw/' # 統一投信: 1.5萬 pageurl = 'https://www.facebook.com/MoneyMoneyMeg/' # Money錢: 20.7萬 pageurl = 'https://www.facebook.com/imoneymagazine/' # iMoney 智富雜誌: 38萬 pageurl = 'https://www.facebook.com/edigest/' # 經濟一週 EDigest: 36.2萬 pageurl = 'https://www.facebook.com/BToday/' # 今周刊:107萬 pageurl = 'https://www.facebook.com/GreenHornFans/' # 綠角財經筆記: 25萬 pageurl = 'https://www.facebook.com/ec.ltn.tw/' # 自由時報財經頻道 42,656人在追蹤 pageurl = 'https://www.facebook.com/MoneyDJ' # MoneyDJ理財資訊 141,302人在追蹤 pageurl = 'https://www.facebook.com/YahooTWFinance/' # Yahoo奇摩股市理財 149,624人在追蹤 pageurl = 'https://www.facebook.com/win3105' pageurl = 'https://www.facebook.com/Diss%E7%BA%8F%E7%B6%BF-111182238148502/' # fb_api_req_friendly_name: CometUFICommentsProviderQuery pageurl = 'https://www.facebook.com/anuetw/' # Anue鉅亨網財經新聞: 31.2萬追蹤 pageurl = 'https://www.facebook.com/wealtholic/' # 投資癮 Wealtholic: 2.萬 # fb_api_req_friendly_name: PresenceStatusProviderSubscription_ContactProfilesQuery # fb_api_req_friendly_name: GroupsCometFeedRegularStoriesPaginationQuery pageurl = 'https://www.facebook.com/groups/pythontw' pageurl = 'https://www.facebook.com/groups/corollacrossclub/' df = Crawl_PagePosts(pageurl, until_date='2022-08-10') # df = Crawl_RelatedPages(seedpages=pageurls, rounds=10) df = pd.read_csv( './data/relatedpages_edgetable.csv')[['SOURCE', 'TARGET', 'ROUND']] headers = _get_headers(pageurl=pageurl) for pagenum in tqdm(df['index']): try: Crawl_PageInfo(pagenum=pagenum, pageurl=df['pageurl'][pagenum]) except: pass homepage_response = _get_homepage(pageurl, headers) pageinfo = _parse_pageinfo(homepage_response) # import pandas as pd from main import Crawl_PagePosts pageurl = 'https://www.facebook.com/hatendhu' df = Crawl_PagePosts(pageurl, until_date='2014-11-01') df df.to_pickle('./data/20220926_hatendhu.pkl') ================================================ FILE: paser.py ================================================ import re import json import pandas as pd from bs4 import BeautifulSoup from utils import _extract_id, _init_request_vars import datetime from requester import _get_pageabout, _get_pagetransparency, _get_homepage, _get_posts, _get_headers import requests # Post-Paser def _parse_edgelist(resp): ''' Take edges from the response by graphql api ''' edges = [] try: edges = resp.json()['data']['node']['timeline_feed_units']['edges'] except: for data in resp.text.split('\r\n', -1): try: edges.append(json.loads(data)[ 'data']['node']['timeline_list_feed_units']['edges'][0]) except: edges.append(json.loads(data)['data']) return edges def _parse_edge(edge): ''' Parse edge to take informations, such as post name, id, message..., etc. ''' comet_sections = edge['node']['comet_sections'] # name name = comet_sections['context_layout']['story']['comet_sections']['actor_photo']['story']['actors'][0]['name'] # creation_time creation_time = comet_sections['context_layout']['story']['comet_sections']['metadata'][0]['story']['creation_time'] # message try: message = comet_sections['content']['story']['comet_sections']['message']['story']['message']['text'] except: try: message = comet_sections['content']['story']['comet_sections']['message_container']['story']['message']['text'] except: message = comet_sections['content']['story']['comet_sections']['message_container'] # postid postid = comet_sections['feedback']['story']['feedback_context'][ 'feedback_target_with_context']['ufi_renderer']['feedback']['subscription_target_id'] # actorid pageid = comet_sections['context_layout']['story']['comet_sections']['actor_photo']['story']['actors'][0]['id'] # comment_count comment_count = comet_sections['feedback']['story']['feedback_context'][ 'feedback_target_with_context']['ufi_renderer']['feedback']['comment_count']['total_count'] # reaction_count reaction_count = comet_sections['feedback']['story']['feedback_context']['feedback_target_with_context'][ 'ufi_renderer']['feedback']['comet_ufi_summary_and_actions_renderer']['feedback']['reaction_count']['count'] # share_count share_count = comet_sections['feedback']['story']['feedback_context']['feedback_target_with_context'][ 'ufi_renderer']['feedback']['comet_ufi_summary_and_actions_renderer']['feedback']['share_count']['count'] # toplevel_comment_count toplevel_comment_count = comet_sections['feedback']['story']['feedback_context'][ 'feedback_target_with_context']['ufi_renderer']['feedback']['toplevel_comment_count']['count'] # top_reactions top_reactions = comet_sections['feedback']['story']['feedback_context']['feedback_target_with_context']['ufi_renderer'][ 'feedback']['comet_ufi_summary_and_actions_renderer']['feedback']['cannot_see_top_custom_reactions']['top_reactions']['edges'] # comet_footer_renderer for link try: comet_footer_renderer = comet_sections['content']['story']['attachments'][0]['comet_footer_renderer'] # attachment_title attachment_title = comet_footer_renderer['attachment']['title_with_entities']['text'] # attachment_description attachment_description = comet_footer_renderer['attachment']['description']['text'] except: attachment_title = '' attachment_description = '' # all_subattachments for photos try: try: media = comet_sections['content']['story']['attachments'][0]['styles']['attachment']['all_subattachments']['nodes'] attachments_photos = ', '.join( [image['media']['viewer_image']['uri'] for image in media]) except: media = comet_sections['content']['story']['attachments'][0]['styles']['attachment'] attachments_photos = media['media']['photo_image']['uri'] except: attachments_photos = '' # cursor cursor = edge['cursor'] # actor url actor_url = comet_sections['context_layout']['story']['comet_sections']['actor_photo']['story']['actors'][0]['url'] # post url post_url = comet_sections['content']['story']['wwwURL'] return [name, pageid, postid, creation_time, message, reaction_count, comment_count, toplevel_comment_count, share_count, top_reactions, attachment_title, attachment_description, attachments_photos, cursor, actor_url, post_url] def _parse_domops(resp): ''' Take name, data id, time , message and page link from domops ''' data = re.sub(r'for \(;;\);', '', resp.text) data = json.loads(data) domops = data['domops'][0][3]['__html'] cursor = re.findall( 'timeline_cursor%22%3A%22(.*?)%22%2C%22timeline_section_cursor', domops)[0] content_list = [] soup = BeautifulSoup(domops, 'lxml') for content in soup.findAll('div', {'class': 'userContentWrapper'}): # name name = content.find('img')['aria-label'] # id dataid = content.find('div', {'data-testid': 'story-subtitle'})['id'] # actorid pageid = _extract_id(dataid, 0) # postid postid = _extract_id(dataid, 1) # time time = content.find('abbr')['data-utime'] # message message = content.find('div', {'data-testid': 'post_message'}) if message == None: message = '' else: if len(message.findAll('p')) >= 1: message = ''.join(p.text for p in message.findAll('p')) elif len(message.select('span > span')) >= 2: message = message.find('span').text # attachment_title try: attachment_title = content.find( 'a', {'data-lynx-mode': 'hover'})['aria-label'] except: attachment_title = '' # attachment_description try: attachment_description = content.find( 'a', {'data-lynx-mode': 'hover'}).text except: attachment_description = '' # actor_url actor_url = content.find('a')['href'].split('?')[0] # post_url post_url = 'https://www.facebook.com/' + postid content_list.append([name, pageid, postid, time, message, attachment_title, attachment_description, cursor, actor_url, post_url]) return content_list, cursor def _parse_jsmods(resp): ''' Take postid, pageid, comment count , reaction count, sharecount, reactions and display_comments_count from jsmods ''' data = re.sub(r'for \(;;\);', '', resp.text) data = json.loads(data) jsmods = data['jsmods'] requires_list = [] for requires in jsmods['pre_display_requires']: try: feedback = requires[3][1]['__bbox']['result']['data']['feedback'] # subscription_target_id ==> postid subscription_target_id = feedback['subscription_target_id'] # owning_profile_id ==> pageid owning_profile_id = feedback['owning_profile']['id'] # comment_count comment_count = feedback['comment_count']['total_count'] # reaction_count reaction_count = feedback['reaction_count']['count'] # share_count share_count = feedback['share_count']['count'] # top_reactions top_reactions = feedback['top_reactions']['edges'] # display_comments_count display_comments_count = feedback['display_comments_count']['count'] # append data to list requires_list.append([subscription_target_id, owning_profile_id, comment_count, reaction_count, share_count, top_reactions, display_comments_count]) except: pass # reactions--video posts for requires in jsmods['require']: try: # entidentifier ==> postid entidentifier = requires[3][2]['feedbacktarget']['entidentifier'] # pageid actorid = requires[3][2]['feedbacktarget']['actorid'] # comment count commentcount = requires[3][2]['feedbacktarget']['commentcount'] # reaction count likecount = requires[3][2]['feedbacktarget']['likecount'] # sharecount sharecount = requires[3][2]['feedbacktarget']['sharecount'] # reactions reactions = [] # display_comments_count commentcount = requires[3][2]['feedbacktarget']['commentcount'] # append data to list requires_list.append( [entidentifier, actorid, commentcount, likecount, sharecount, reactions, commentcount]) except: pass return requires_list def _parse_composite_graphql(resp): edges = _parse_edgelist(resp) df = [] for edge in edges: try: ndf = _parse_edge(edge) df.append(ndf) except: pass df = pd.DataFrame(df, columns=['NAME', 'PAGEID', 'POSTID', 'TIME', 'MESSAGE', 'REACTIONCOUNT', 'COMMENTCOUNT', 'DISPLAYCOMMENTCOUNT', 'SHARECOUNT', 'REACTIONS', 'ATTACHMENT_TITLE', 'ATTACHMENT_DESCRIPTION', 'ATTACHMENT_PHOTOS', 'CURSOR', 'ACTOR_URL', 'POST_URL']) df = df[['NAME', 'PAGEID', 'POSTID', 'TIME', 'MESSAGE', 'ATTACHMENT_TITLE', 'ATTACHMENT_DESCRIPTION', 'ATTACHMENT_PHOTOS', 'REACTIONCOUNT', 'COMMENTCOUNT', 'DISPLAYCOMMENTCOUNT', 'SHARECOUNT', 'REACTIONS', 'CURSOR', 'ACTOR_URL', 'POST_URL']] cursor = df['CURSOR'].to_list()[-1] df['TIME'] = df['TIME'].apply(lambda x: datetime.datetime.fromtimestamp( int(x)).strftime("%Y-%m-%d %H:%M:%S")) max_date = df['TIME'].max() print('The maximum date of these posts is: {}, keep crawling...'.format(max_date)) return df, max_date, cursor def _parse_composite_nojs(resp): domops, cursor = _parse_domops(resp) domops = pd.DataFrame(domops, columns=['NAME', 'PAGEID', 'POSTID', 'TIME', 'MESSAGE', 'ATTACHMENT_TITLE', 'ATTACHMENT_DESCRIPTION', 'CURSOR', 'ACTOR_URL', 'POST_URL']) domops['TIME'] = domops['TIME'].apply( lambda x: datetime.datetime.fromtimestamp(int(x)).strftime("%Y-%m-%d %H:%M:%S")) jsmods = _parse_jsmods(resp) jsmods = pd.DataFrame(jsmods, columns=[ 'POSTID', 'PAGEID', 'COMMENTCOUNT', 'REACTIONCOUNT', 'SHARECOUNT', 'REACTIONS', 'DISPLAYCOMMENTCOUNT']) df = pd.merge(left=domops, right=jsmods, how='inner', on=['PAGEID', 'POSTID']) df = df[['NAME', 'PAGEID', 'POSTID', 'TIME', 'MESSAGE', 'ATTACHMENT_TITLE', 'ATTACHMENT_DESCRIPTION', 'REACTIONCOUNT', 'COMMENTCOUNT', 'DISPLAYCOMMENTCOUNT', 'SHARECOUNT', 'REACTIONS', 'CURSOR', 'ACTOR_URL', 'POST_URL']] max_date = df['TIME'].max() print('The maximum date of these posts is: {}, keep crawling...'.format(max_date)) return df, max_date, cursor # Page paser def _parse_pagetype(homepage_response): if '/groups/' in homepage_response.url: pagetype = 'Group' else: pagetype = 'Fanspage' return pagetype def _parse_pagename(homepage_response): raw_json = homepage_response.text.encode('utf-8').decode('unicode_escape') # pattern1 if len(re.findall(r'{"page":{"name":"(.*?)",', raw_json)) >= 1: pagename = re.findall(r'{"page":{"name":"(.*?)",', raw_json)[0] pagename = re.sub(r'\s\|\sFacebook', '', pagename) return pagename # pattern2 if len(re.findall('","name":"(.*?)","', raw_json)) >= 1: pagename = re.findall('","name":"(.*?)","', raw_json)[0] pagename = re.sub(r'\s\|\sFacebook', '', pagename) return pagename def _parse_entryPoint(homepage_response): try: entryPoint = re.findall( '"entryPoint":{"__dr":"(.*?)"}}', homepage_response.text)[0] except: entryPoint = 'nojs' return entryPoint def _parse_identifier(entryPoint, homepage_response): if entryPoint in ['ProfilePlusCometLoggedOutRouteRoot.entrypoint', 'CometGroupDiscussionRoot.entrypoint']: # pattern 1 if len(re.findall('"identifier":"{0,1}([0-9]{5,})"{0,1},', homepage_response.text)) >= 1: identifier = re.findall( '"identifier":"{0,1}([0-9]{5,})"{0,1},', homepage_response.text)[0] # pattern 2 elif len(re.findall('fb://profile/(.*?)"', homepage_response.text)) >= 1: identifier = re.findall( 'fb://profile/(.*?)"', homepage_response.text)[0] # pattern 3 elif len(re.findall('content="fb://group/([0-9]{1,})" />', homepage_response.text)) >= 1: identifier = re.findall( 'content="fb://group/([0-9]{1,})" />', homepage_response.text)[0] elif entryPoint in ['CometSinglePageHomeRoot.entrypoint', 'nojs']: # pattern 1 if len(re.findall('"pageID":"{0,1}([0-9]{5,})"{0,1},', homepage_response.text)) >= 1: identifier = re.findall( '"pageID":"{0,1}([0-9]{5,})"{0,1},', homepage_response.text)[0] return identifier def _parse_docid(entryPoint, homepage_response): soup = BeautifulSoup(homepage_response.text, 'lxml') if entryPoint == 'nojs': docid = 'NoDocid' else: for link in soup.findAll('link', {'rel': 'preload'}): resp = requests.get(link['href']) for line in resp.text.split('\n', -1): if 'ProfileCometTimelineFeedRefetchQuery_' in line: docid = re.findall('e.exports="([0-9]{1,})"', line)[0] break if 'CometModernPageFeedPaginationQuery_' in line: docid = re.findall('e.exports="([0-9]{1,})"', line)[0] break if 'CometUFICommentsProviderQuery_' in line: docid = re.findall('e.exports="([0-9]{1,})"', line)[0] break if 'GroupsCometFeedRegularStoriesPaginationQuery' in line: docid = re.findall('e.exports="([0-9]{1,})"', line)[0] break if 'docid' in locals(): break return docid def _parse_likes(homepage_response, entryPoint, headers): if entryPoint in ['CometGroupDiscussionRoot.entrypoint']: pageabout = _get_pageabout(homepage_response, entryPoint, headers) members = re.findall( ',"group_total_members_info_text":"(.*?) total members","', pageabout.text)[0] members = re.sub(',', '', members) return members else: # pattern 1 data = re.findall( '"page_likers":{"global_likers_count":([0-9]{1,})},"', homepage_response.text) if len(data) >= 1: likes = data[0] return likes # pattern 2 data = re.findall( ' ([0-9]{0,},{0,}[0-9]{0,},{0,}[0-9]{0,},{0,}[0-9]{0,},{0,}[0-9]{0,},{0,}) likes', homepage_response.text) if len(data) >= 1: likes = data[0] likes = re.sub(',', '', likes) return likes def _parse_creation_time(homepage_response, entryPoint, headers): try: if entryPoint in ['ProfilePlusCometLoggedOutRouteRoot.entrypoint']: transparency_response = _get_pagetransparency( homepage_response, entryPoint, headers) transparency_info = re.findall( '"field_section_type":"transparency","profile_fields":{"nodes":\[{"title":(.*?}),"field_type":"creation_date",', transparency_response.text)[0] creation_time = json.loads(transparency_info)['text'] elif entryPoint in ['CometSinglePageHomeRoot.entrypoint']: creation_time = re.findall( ',"page_creation_date":{"text":"Page created - (.*?)"},', homepage_response.text)[0] elif entryPoint in ['nojs']: if len(re.findall('Page created - (.*?)', homepage_response.text)) >= 1: creation_time = re.findall( 'Page created - (.*?)', homepage_response.text)[0] else: creation_time = re.findall( ',"foundingDate":"(.*?)"}', homepage_response.text)[0][:10] elif entryPoint in ['CometGroupDiscussionRoot.entrypoint']: pageabout = _get_pageabout(homepage_response, entryPoint, headers) creation_time = re.findall( '"group_history_summary":{"text":"Group created on (.*?)"}},', pageabout.text)[0] try: creation_time = datetime.datetime.strptime( creation_time, '%B %d, %Y') except: creation_time = creation_time + ', ' + datetime.datetime.now().year creation_time = datetime.datetime.strptime( creation_time, '%B %d, %Y') creation_time = creation_time.strftime('%Y-%m-%d') except: creation_time = 'NotAvailable' return creation_time def _parse_category(homepage_response, entryPoint, headers): pageabout = _get_pageabout(homepage_response, entryPoint, headers) if entryPoint in ['ProfilePlusCometLoggedOutRouteRoot.entrypoint']: if 'Page \\u00b7 Politician' in pageabout.text: category = 'Politician' if len(re.findall(r'"text":"Page \\u00b7 (.*?)"}', homepage_response.text)) >= 1: category = re.findall( r'"text":"Page \\u00b7 (.*?)"}', homepage_response.text)[0] else: soup = BeautifulSoup(pageabout.text) for script in soup.findAll('script', {'type': 'application/ld+json'}): if 'BreadcrumbList' in script.text: data = script.text.encode('utf-8').decode('unicode_escape') category = json.loads(data)['itemListElement'] category = ' / '.join([cate['name'] for cate in category]) elif entryPoint in ['CometSinglePageHomeRoot.entrypoint', 'nojs']: if len(re.findall('","category_name":"(.*?)","', homepage_response.text)) >= 1: category = re.findall( '","category_name":"(.*?)","', homepage_response.text) category = ' / '.join([cate for cate in category]) else: soup = BeautifulSoup(homepage_response.text) if len(soup.findAll('span', {'itemprop': 'itemListElement'})) >= 1: category = [span.text for span in soup.findAll( 'span', {'itemprop': 'itemListElement'})] category = ' / '.join(category) else: for script in soup.findAll('script', {'type': 'application/ld+json'}): if 'BreadcrumbList' in script.text: data = script.text.encode( 'utf-8').decode('unicode_escape') category = json.loads(data)['itemListElement'] category = ' / '.join([cate['name'] for cate in category]) elif entryPoint in ['PagesCometAdminSelfViewAboutContainerRoot.entrypoint']: category = eval(re.findall( '"page_categories":(.*?),"addressEditable', homepage_response.text)[0]) category = ' / '.join([cate['text'] for cate in category]) elif entryPoint in ['CometGroupDiscussionRoot.entrypoint']: category = 'Group' try: category = re.sub(r'\\/', '/', category) except: category = '' return category def _parse_pageurl(homepage_response): pageurl = homepage_response.url pageurl = re.sub('/$', '', pageurl) return pageurl def _parse_relatedpages(homepage_response, entryPoint, identifier): relatedpages = [] if entryPoint in ['CometSinglePageHomeRoot.entrypoint']: try: data = re.findall( r'"related_pages":\[(.*?)\],"view_signature"', homepage_response.text)[0] data = re.sub('},{', '},,,,{', data) for pages in data.split(',,,,', -1): # print('id:', json.loads(pages)['id']) # print('category_name:', json.loads(pages)['category_name']) # print('name:', json.loads(pages)['name']) url = json.loads(pages)['url'] url = url.split('?', -1)[0] url = re.sub(r'/$', '', url) # print('url:', url) # print('========') relatedpages.append(url) except: pass elif entryPoint in ['nojs']: soup = BeautifulSoup(homepage_response.text, 'lxml') soup = soup.find( 'div', {'id': 'PageRelatedPagesSecondaryPagelet_{}'.format(identifier)}) for page in soup.select('ul > li > div'): # print('name: ', page.find('img')['aria-label']) url = page.find('a')['href'] url = url.split('?', -1)[0] url = re.sub(r'/$', '', url) # print('url:', url) # print('===========') relatedpages.append(url) elif entryPoint in ['ProfilePlusCometLoggedOutRouteRoot.entrypoint', 'CometGroupDiscussionRoot.entrypoint']: pass # print('There\'s no related pages recommend.') return relatedpages def _parse_pageinfo(homepage_response): ''' Parse the homepage response to get the page information, including id, docid and api_name. ''' # pagetype pagetype = _parse_pagetype(homepage_response) # pagename pagename = _parse_pagename(homepage_response) # entryPoint entryPoint = _parse_entryPoint(homepage_response) # identifier identifier = _parse_identifier(entryPoint, homepage_response) # docid docid = _parse_docid(entryPoint, homepage_response) # likes / members likes = _parse_likes(homepage_response, entryPoint, headers) # creation time creation_time = _parse_creation_time( homepage_response, entryPoint, headers) # category category = _parse_category(homepage_response, entryPoint, headers) # pageurl pageurl = _parse_pageurl(homepage_response) return [pagetype, pagename, identifier, likes, creation_time, category, pageurl] if __name__ == '__main__': # pageurls pageurl = 'https://www.facebook.com/mohw.gov.tw' pageurl = 'https://www.facebook.com/groups/pythontw' pageurl = 'https://www.facebook.com/Gooaye' pageurl = 'https://www.facebook.com/emily0806' pageurl = 'https://www.facebook.com/anuetw/' pageurl = 'https://www.facebook.com/wealtholic/' pageurl = 'https://www.facebook.com/hatendhu' headers = _get_headers(pageurl) headers['Referer'] = 'https://www.facebook.com/hatendhu' headers['Origin'] = 'https://www.facebook.com' headers['Cookie'] = 'dpr=1.5; datr=rzIwY5yARwMzcR9H2GyqId_l' homepage_response = _get_homepage(pageurl=pageurl, headers=headers) entryPoint = _parse_entryPoint(homepage_response) print(entryPoint) identifier = _parse_identifier(entryPoint, homepage_response) docid = _parse_docid(entryPoint, homepage_response) df, cursor, max_date, break_times = _init_request_vars(cursor='') cursor = 'AQHRlIMW9sczmHGnME47XeSdDNj6Jk9EcBOMlyxBdMNbZHM7dwd0rn8wsaxQxeXUsuhKVaMgVwPHb9YS9468INvb5yw2osoEmXd_sMXvj8rLhmBxeaJucMSPIDux_JuiHToC' cursor = 'AQHRxSZTqUvlLpkXCnrOjdX0gZeyn-Q1cuJzn4SPJuZ5rkYi7nZFByE5pwy4AsBoUOtcmF28lNfXR_rqv7oO7545iURm_mx46aZLBDiYfPmgI2mjscHUTiVi5vv1vj5EXiF4' resp = _get_posts(headers=headers, identifier=identifier, entryPoint=entryPoint, docid=docid, cursor=cursor) # graphql edges = _parse_edgelist(resp) print(len(edges)) _parse_edge(edges[0]) edges[0].keys() edges[0]['node'].keys() edges[0]['node']['comet_sections'].keys() edges[0]['node']['comet_sections'] df, max_date, cursor = _parse_composite_graphql(resp) df # nojs content_list, cursor = _parse_domops(resp) df, max_date, cursor = _parse_composite_nojs(resp) # page paser pagename = _parse_pagename(homepage_response).encode('utf-8').decode() likes = _parse_likes(homepage_response, entryPoint, headers) creation_time = _parse_creation_time( homepage_response=homepage_response, entryPoint=entryPoint, headers=headers) category = _parse_category(homepage_response, entryPoint, headers) pageurl = _parse_pageurl(homepage_response) ================================================ FILE: requester.py ================================================ import re import requests import time from utils import _init_request_vars def _get_headers(pageurl): ''' Send a request to get cookieid as headers. ''' pageurl = re.sub('www', 'm', pageurl) resp = requests.get(pageurl) headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'accept-language': 'en'} headers['cookie'] = '; '.join(['{}={}'.format(cookieid, resp.cookies.get_dict()[ cookieid]) for cookieid in resp.cookies.get_dict()]) # headers['cookie'] = headers['cookie'] + '; locale=en_US' return headers def _get_homepage(pageurl, headers): ''' Send a request to get the homepage response ''' pageurl = re.sub('/$', '', pageurl) timeout_cnt = 0 while True: try: homepage_response = requests.get( pageurl, headers=headers, timeout=3) return homepage_response except: time.sleep(5) timeout_cnt = timeout_cnt + 1 if timeout_cnt > 20: class homepage_response(): text = 'Sorry, something went wrong.' return homepage_response def _get_pageabout(homepage_response, entryPoint, headers): ''' Send a request to get the about page response ''' pageurl = re.sub('/$', '', homepage_response.url) pageabout = requests.get(pageurl + '/about', headers=headers) return pageabout def _get_pagetransparency(homepage_response, entryPoint, headers): ''' Send a request to get the transparency page response ''' pageurl = re.sub('/$', '', homepage_response.url) if entryPoint in ['ProfilePlusCometLoggedOutRouteRoot.entrypoint']: transparency_response = requests.get( pageurl + '/about_profile_transparency', headers=headers) return transparency_response def _get_posts(headers, identifier, entryPoint, docid, cursor): ''' Send a request to get new posts from fanspage/group. ''' if entryPoint in ['nojs']: params = {'page_id': identifier, 'cursor': str({"timeline_cursor": cursor, "timeline_section_cursor": '{}', "has_next_page": 'true'}), 'surface': 'www_pages_posts', 'unit_count': 10, '__a': '1'} resp = requests.get(url='https://www.facebook.com/pages_reaction_units/more/', params=params) else: # entryPoint in ['CometSinglePageHomeRoot.entrypoint', 'ProfilePlusCometLoggedOutRouteRoot.entrypoint', 'CometGroupDiscussionRoot.entrypoint'] data = {'variables': str({'cursor': cursor, 'id': identifier, 'count': 3}), 'doc_id': docid} resp = requests.post(url='https://www.facebook.com/api/graphql/', data=data, headers=headers) return resp if __name__ == '__main__': pageurl = 'https://www.facebook.com/ec.ltn.tw/' pageurl = 'https://www.facebook.com/Gooaye' pageurl = 'https://www.facebook.com/groups/pythontw' pageurl = 'https://www.facebook.com/hatendhu' headers = _get_headers(pageurl) homepage_response = _get_homepage(pageurl=pageurl, headers=headers) df, cursor, max_date, break_times = _init_request_vars() cursor = 'AQHRlIMW9sczmHGnME47XeSdDNj6Jk9EcBOMlyxBdMNbZHM7dwd0rn8wsaxQxeXUsuhKVaMgVwPHb9YS9468INvb5yw2osoEmXd_sMXvj8rLhmBxeaJucMSPIDux_JuiHToC' cursor = 'AQHRixL5fPMA_nM-78jGg4LohG3M4a2-YQR6WSaWOTiqPRJ1dOGchYRzp1wdDtusNd-5FkCPXwByL_kZM2iyLIz1XHB8WIEzHYXTU3vQzviOI9GexNv__RPn1xnFJZddnjX3' from paser import _parse_entryPoint, _parse_identifier, _parse_docid, _parse_composite_graphql entryPoint = _parse_entryPoint(homepage_response) identifier = _parse_identifier(entryPoint, homepage_response) docid = _parse_docid(entryPoint, homepage_response) df, cursor, max_date, break_times = _init_request_vars(cursor='') resp = _get_posts(headers=headers, identifier=identifier, entryPoint=entryPoint, docid=docid, cursor=cursor) ndf, max_date, cursor = _parse_composite_graphql(resp) resp.json() ndf max_date cursor # print(len(resp.text)) ================================================ FILE: requirements.txt ================================================ requests==2.24.0 bs4==0.0.1 pandas==1.2.4 numpy==1.20.3 dicttoxml==1.7.4 lxml==4.9.2 ================================================ FILE: sample/20221013_Sample.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 3, "id": "7946c9a9-0a27-4de2-a32e-8de574f8d7fb", "metadata": { "execution": { "iopub.execute_input": "2022-10-13T14:01:50.402181Z", "iopub.status.busy": "2022-10-13T14:01:50.402181Z", "iopub.status.idle": "2022-10-13T14:02:32.391460Z", "shell.execute_reply": "2022-10-13T14:02:32.391460Z", "shell.execute_reply.started": "2022-10-13T14:01:50.402181Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The maximum date of these posts is: 2022-10-13 01:06:00, keep crawling...\n", "The maximum date of these posts is: 2022-10-12 19:47:59, keep crawling...\n", "The maximum date of these posts is: 2022-10-12 19:47:37, keep crawling...\n", "The maximum date of these posts is: 2022-10-12 19:47:19, keep crawling...\n", "The maximum date of these posts is: 2022-10-12 02:37:08, keep crawling...\n", "The maximum date of these posts is: 2022-10-12 02:36:56, keep crawling...\n", "The maximum date of these posts is: 2022-10-12 02:36:50, keep crawling...\n", "The maximum date of these posts is: 2022-10-10 23:07:48, keep crawling...\n", "The maximum date of these posts is: 2022-10-10 18:18:13, keep crawling...\n", "The maximum date of these posts is: 2022-10-09 22:47:33, keep crawling...\n", "The maximum date of these posts is: 2022-10-08 23:59:35, keep crawling...\n", "The maximum date of these posts is: 2022-10-08 16:27:26, keep crawling...\n", "The maximum date of these posts is: 2022-10-07 21:05:48, keep crawling...\n", "The maximum date of these posts is: 2022-10-07 21:05:34, keep crawling...\n", "The maximum date of these posts is: 2022-10-07 21:05:14, keep crawling...\n", "The maximum date of these posts is: 2022-10-07 21:04:46, keep crawling...\n", "The maximum date of these posts is: 2022-10-07 21:04:17, keep crawling...\n", "The maximum date of these posts is: 2022-10-07 00:58:39, keep crawling...\n", "The maximum date of these posts is: 2022-10-06 23:56:05, keep crawling...\n", "The maximum date of these posts is: 2022-10-06 22:25:05, keep crawling...\n", "The maximum date of these posts is: 2022-10-06 22:24:57, keep crawling...\n", "The maximum date of these posts is: 2022-10-06 22:24:48, keep crawling...\n", "The maximum date of these posts is: 2022-10-06 22:24:39, keep crawling...\n", "The maximum date of these posts is: 2022-10-06 22:24:31, keep crawling...\n", "The maximum date of these posts is: 2022-10-06 22:24:19, keep crawling...\n", "The maximum date of these posts is: 2022-10-05 21:33:20, keep crawling...\n", "The maximum date of these posts is: 2022-10-05 21:33:12, keep crawling...\n", "The maximum date of these posts is: 2022-10-05 21:33:05, keep crawling...\n", "The maximum date of these posts is: 2022-10-05 21:32:59, keep crawling...\n", "The maximum date of these posts is: 2022-10-04 23:21:34, keep crawling...\n", "The maximum date of these posts is: 2022-10-04 23:21:14, keep crawling...\n", "The maximum date of these posts is: 2022-10-04 23:20:37, keep crawling...\n", "The maximum date of these posts is: 2022-10-04 23:19:57, keep crawling...\n", "The maximum date of these posts is: 2022-10-04 23:19:33, keep crawling...\n", "The maximum date of these posts is: 2022-10-03 22:40:18, keep crawling...\n", "The maximum date of these posts is: 2022-10-01 06:49:51, keep crawling...\n", "The maximum date of these posts is: 2022-10-01 06:49:39, keep crawling...\n", "The maximum date of these posts is: 2022-10-01 06:49:32, keep crawling...\n", "The maximum date of these posts is: 2022-09-30 00:28:37, keep crawling...\n" ] }, { "data": { "text/html": [ "
| \n", " | NAME | \n", "PAGEID | \n", "POSTID | \n", "TIME | \n", "MESSAGE | \n", "ATTACHMENT_TITLE | \n", "ATTACHMENT_DESCRIPTION | \n", "ATTACHMENT_PHOTOS | \n", "REACTIONCOUNT | \n", "COMMENTCOUNT | \n", "DISPLAYCOMMENTCOUNT | \n", "SHARECOUNT | \n", "REACTIONS | \n", "CURSOR | \n", "ACTOR_URL | \n", "POST_URL | \n", "UPDATETIME | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "黑特東華 NDHU Hate | \n", "100064708507691 | \n", "476738834493063 | \n", "2022-10-13 01:06:00 | \n", "#52635\\n\\n志學街一堆店家把東華學生當搖錢樹,大家應該聯合抵制一下,盤子店通通給他倒... | \n", "\n", " | \n", " | \n", " | 166 | \n", "25 | \n", "15 | \n", "2 | \n", "[{'reaction_count': 138, 'node': {'id': '16358... | \n", "AQHRWkzldCEuXabhI1tJPZnVEn7FKGxga7nPHhIBgGzMmB... | \n", "https://www.facebook.com/people/%E9%BB%91%E7%8... | \n", "https://www.facebook.com/permalink.php?story_f... | \n", "2022-10-13 22:02:32 | \n", "
| 1 | \n", "黑特東華 NDHU Hate | \n", "100064708507691 | \n", "476738777826402 | \n", "2022-10-13 01:05:58 | \n", "#52634\\n\\n電音社第二次社課\\n\\n提醒各位這禮拜日為電音社第二次上課,不管是沒有參... | \n", "\n", " | \n", " | https://scontent.ftpe10-1.fna.fbcdn.net/v/t39.... | \n", "3 | \n", "0 | \n", "0 | \n", "0 | \n", "[{'reaction_count': 2, 'node': {'id': '1635855... | \n", "AQHRRN-x6Hvn-AZ35SOZ2CNtJpaM3_9yYj9j9dTx5LxCWy... | \n", "https://www.facebook.com/people/%E9%BB%91%E7%8... | \n", "https://www.facebook.com/permalink.php?story_f... | \n", "2022-10-13 22:02:32 | \n", "
| 2 | \n", "黑特東華 NDHU Hate | \n", "100064708507691 | \n", "476533761180237 | \n", "2022-10-12 19:48:11 | \n", "#52633\\n\\n10/12晚上7.左右在理工停車場靠學活那側撿到BKS1(安全帽藍芽耳機... | \n", "\n", " | \n", " | https://scontent.ftpe10-1.fna.fbcdn.net/v/t39.... | \n", "3 | \n", "0 | \n", "0 | \n", "0 | \n", "[{'reaction_count': 3, 'node': {'id': '1635855... | \n", "AQHRD6AVoPeV1kHsvTjT0XPsC42w8-nrRawALu9lBRlhH5... | \n", "https://www.facebook.com/people/%E9%BB%91%E7%8... | \n", "https://www.facebook.com/permalink.php?story_f... | \n", "2022-10-13 22:02:32 | \n", "
| 3 | \n", "黑特東華 NDHU Hate | \n", "100064708507691 | \n", "476533567846923 | \n", "2022-10-12 19:47:59 | \n", "#52632\\n\\n#非黑特\\nThis is 頌啦!浪‧人‧鬆‧餅👏👏👏\\n東華巡迴場10... | \n", "\n", " | \n", " | https://scontent.ftpe10-1.fna.fbcdn.net/v/t39.... | \n", "16 | \n", "0 | \n", "0 | \n", "1 | \n", "[{'reaction_count': 16, 'node': {'id': '163585... | \n", "AQHRMe5o8KJIMTVYawF1o8Vd0cJzDc1dF1eum3VLMIYTpL... | \n", "https://www.facebook.com/people/%E9%BB%91%E7%8... | \n", "https://www.facebook.com/permalink.php?story_f... | \n", "2022-10-13 22:02:32 | \n", "
| 4 | \n", "黑特東華 NDHU Hate | \n", "100064708507691 | \n", "476533317846948 | \n", "2022-10-12 19:47:47 | \n", "#52631\\n\\n想問下 禮拜一晚上有上拳擊課的學生 \\n因為連假所以補課 想問是什麼時候... | \n", "\n", " | \n", " | \n", " | 9 | \n", "9 | \n", "4 | \n", "0 | \n", "[{'reaction_count': 9, 'node': {'id': '1635855... | \n", "AQHRppObbMGsrzRdDW6mI7HAOUiedMntD77Xe_-pnyteE3... | \n", "https://www.facebook.com/people/%E9%BB%91%E7%8... | \n", "https://www.facebook.com/permalink.php?story_f... | \n", "2022-10-13 22:02:32 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 112 | \n", "黑特東華 NDHU Hate | \n", "100064708507691 | \n", "466889638811316 | \n", "2022-10-01 06:49:29 | \n", "#52524\\n\\n29號接近午夜時分在外環拉轉按喇叭,後來一路騎進學人宿舍往舊宿去、還繼續... | \n", "\n", " | \n", " | https://scontent.ftpe10-1.fna.fbcdn.net/v/t39.... | \n", "47 | \n", "5 | \n", "2 | \n", "1 | \n", "[{'reaction_count': 34, 'node': {'id': '163585... | \n", "AQHRsblILurBMfS7TsLUVyW3GVjCkGIVSJLXMn3lb0_a0P... | \n", "https://www.facebook.com/people/%E9%BB%91%E7%8... | \n", "https://www.facebook.com/permalink.php?story_f... | \n", "2022-10-13 22:02:32 | \n", "
| 113 | \n", "黑特東華 NDHU Hate | \n", "100064708507691 | \n", "465849725581974 | \n", "2022-09-30 00:28:39 | \n", "#52523\\n\\n向晴裝有買車停在宿舍旁的車主可不可以管好自己的車,常常半夜一直逼逼逼逼讓... | \n", "\n", " | \n", " | \n", " | 6 | \n", "0 | \n", "0 | \n", "0 | \n", "[{'reaction_count': 6, 'node': {'id': '1635855... | \n", "AQHRXXwiAQgdKEmnvEo3mUXyzuH6f-wXNQDeyzxZr2kEky... | \n", "https://www.facebook.com/people/%E9%BB%91%E7%8... | \n", "https://www.facebook.com/permalink.php?story_f... | \n", "2022-10-13 22:02:32 | \n", "
| 114 | \n", "黑特東華 NDHU Hate | \n", "100064708507691 | \n", "465849705581976 | \n", "2022-09-30 00:28:37 | \n", "#52522\\n\\n宿舍都知道要公告說晚上十點過後要小聲要安靜\\n然後擷雲宿委還可以快十一點... | \n", "\n", " | \n", " | \n", " | 27 | \n", "6 | \n", "5 | \n", "0 | \n", "[{'reaction_count': 21, 'node': {'id': '163585... | \n", "AQHRB_nYO_BnR2qBBs4LXzM6CPeLfOlTMiCE3ZR6Ed9uZB... | \n", "https://www.facebook.com/people/%E9%BB%91%E7%8... | \n", "https://www.facebook.com/permalink.php?story_f... | \n", "2022-10-13 22:02:32 | \n", "
| 115 | \n", "黑特東華 NDHU Hate | \n", "100064708507691 | \n", "465849682248645 | \n", "2022-09-30 00:28:35 | \n", "#52521\\n\\n*非黑特*\\n誠徵*日領* 兼職、打工、工讀, 10/3(一)3位Pm1... | \n", "\n", " | \n", " | \n", " | 8 | \n", "1 | \n", "1 | \n", "0 | \n", "[{'reaction_count': 8, 'node': {'id': '1635855... | \n", "AQHRQ11A6xsDyh3EgTdpIoAWmxFJQkoSXUrlVoeHXSDylA... | \n", "https://www.facebook.com/people/%E9%BB%91%E7%8... | \n", "https://www.facebook.com/permalink.php?story_f... | \n", "2022-10-13 22:02:32 | \n", "
| 116 | \n", "黑特東華 NDHU Hate | \n", "100064708507691 | \n", "465849665581980 | \n", "2022-09-30 00:28:33 | \n", "#52520\\n\\n同學你的便當🍱留在統冠(志學)\\n由於一直等不到你來拿\\n先幫你冰起來\\... | \n", "\n", " | \n", " | https://scontent.ftpe10-1.fna.fbcdn.net/v/t39.... | \n", "11 | \n", "0 | \n", "0 | \n", "0 | \n", "[{'reaction_count': 8, 'node': {'id': '1635855... | \n", "AQHR5AKAESdSG7AYj3DfxnH6Gb8piuyeh9hP-f4Y9IFOdv... | \n", "https://www.facebook.com/people/%E9%BB%91%E7%8... | \n", "https://www.facebook.com/permalink.php?story_f... | \n", "2022-10-13 22:02:32 | \n", "
117 rows × 17 columns
\n", "| \n", " | NAME | \n", "TIME | \n", "MESSAGE | \n", "LINK | \n", "PAGEID | \n", "POSTID | \n", "COMMENTCOUNT | \n", "REACTIONCOUNT | \n", "SHARECOUNT | \n", "DISPLAYCOMMENTCOUNT | \n", "ANGER | \n", "HAHA | \n", "LIKE | \n", "LOVE | \n", "SORRY | \n", "SUPPORT | \n", "WOW | \n", "UPDATETIME | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "丟丟妹 | \n", "2021-12-03 02:36:02 | \n", "怎麼分不清 到底是誰帶壞誰XD 丟丟妹 和 Alizabeth 娘娘 根本姐妹淘🤣 互相「曉... | \n", "https://www.facebook.com/diudiu333 | \n", "1723714034327589 | \n", "4964611303571163 | \n", "704.0 | \n", "21797.0 | \n", "546.0 | \n", "572.0 | \n", "6.0 | \n", "5798.0 | \n", "15841.0 | \n", "108.0 | \n", "6.0 | \n", "17.0 | \n", "21.0 | \n", "2022-01-03 12:42:18 | \n", "
| 1 | \n", "丟丟妹 | \n", "2021-11-28 21:11:13 | \n", "我喜歡這樣的陽光☀️ 開放留言 想要丟丟賣什麼? 遇到瓶頸了⋯⋯⋯😢 ❤️ #留起來 | \n", "https://www.facebook.com/diudiu333 | \n", "1723714034327589 | \n", "4949186545113639 | \n", "1675.0 | \n", "12674.0 | \n", "22.0 | \n", "1142.0 | \n", "1.0 | \n", "7.0 | \n", "12510.0 | \n", "125.0 | \n", "0.0 | \n", "25.0 | \n", "6.0 | \n", "2022-01-03 12:42:18 | \n", "
| 2 | \n", "丟丟妹 | \n", "2021-11-19 18:02:56 | \n", "人客啊 客倌們😆歡迎留言+1 2A頂級大閘蟹 (含繩6兩) 特價3088免運 (6隻一箱)最... | \n", "https://www.facebook.com/diudiu333 | \n", "1723714034327589 | \n", "4918431521522475 | \n", "719.0 | \n", "32242.0 | \n", "65.0 | \n", "543.0 | \n", "1.0 | \n", "31.0 | \n", "31884.0 | \n", "290.0 | \n", "3.0 | \n", "18.0 | \n", "15.0 | \n", "2022-01-03 12:42:18 | \n", "
| 3 | \n", "丟丟妹 | \n", "2021-11-12 22:35:08 | \n", "丟丟妹 終於不用離婚🤣 這次直播半路認走 #乾哥哥:#郭丟丟! 叫買能力根本 #隔空遺傳?!... | \n", "https://www.facebook.com/diudiu333 | \n", "1723714034327589 | \n", "4894343770597917 | \n", "477.0 | \n", "18072.0 | \n", "656.0 | \n", "431.0 | \n", "23.0 | \n", "3495.0 | \n", "14351.0 | \n", "125.0 | \n", "17.0 | \n", "29.0 | \n", "32.0 | \n", "2022-01-03 12:42:18 | \n", "
| 4 | \n", "丟丟妹 | \n", "2021-11-10 21:00:52 | \n", "各位親愛的帥哥美女們 由於訊息無限爆炸中 (已經快馬加鞭) 有留訊息就是 已接結單唷 造成... | \n", "https://www.facebook.com/diudiu333 | \n", "1723714034327589 | \n", "4887230927975868 | \n", "352.0 | \n", "9598.0 | \n", "17.0 | \n", "220.0 | \n", "1.0 | \n", "19.0 | \n", "9476.0 | \n", "59.0 | \n", "0.0 | \n", "34.0 | \n", "9.0 | \n", "2022-01-03 12:42:18 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 75 | \n", "丟丟妹 | \n", "2020-05-03 16:52:40 | \n", "猜猜我在哪裡 陳冠霖牛仔部落格 連靜雯joanne lien | \n", "https://www.facebook.com/diudiu333 | \n", "1723714034327589 | \n", "184909892573719 | \n", "2921.0 | \n", "12355.0 | \n", "851.0 | \n", "2921.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2022-01-03 12:42:18 | \n", "
| 76 | \n", "丟丟妹 | \n", "2020-04-30 13:36:40 | \n", "踢爆娃娃機... | \n", "https://www.facebook.com/diudiu333 | \n", "1723714034327589 | \n", "3173673282664983 | \n", "946.0 | \n", "22357.0 | \n", "1035.0 | \n", "856.0 | \n", "76.0 | \n", "2661.0 | \n", "19275.0 | \n", "161.0 | \n", "25.0 | \n", "71.0 | \n", "88.0 | \n", "2022-01-03 12:42:18 | \n", "
| 77 | \n", "丟丟妹 | \n", "2020-04-29 21:51:13 | \n", "今日直播暫停乙次 丟丟身體不舒服了😭 明天下午見 開播 要想我喲 嗚嗚😢 | \n", "https://www.facebook.com/diudiu333 | \n", "1723714034327589 | \n", "3171969702835341 | \n", "167.0 | \n", "4123.0 | \n", "8.0 | \n", "164.0 | \n", "0.0 | \n", "3.0 | \n", "3992.0 | \n", "96.0 | \n", "7.0 | \n", "5.0 | \n", "20.0 | \n", "2022-01-03 12:42:18 | \n", "
| 78 | \n", "丟丟妹 | \n", "2020-04-17 19:46:50 | \n", "帥哥美女今天星期五,丟妹偷懶去 明天下午二點 準時見❤️❤️😄有人會報到嗎 刷起來 😂 | \n", "https://www.facebook.com/diudiu333 | \n", "1723714034327589 | \n", "3141849955847316 | \n", "116.0 | \n", "4804.0 | \n", "20.0 | \n", "114.0 | \n", "0.0 | \n", "6.0 | \n", "4744.0 | \n", "48.0 | \n", "1.0 | \n", "0.0 | \n", "5.0 | \n", "2022-01-03 12:42:18 | \n", "
| 79 | \n", "丟丟妹 | \n", "2020-04-15 20:48:08 | \n", "丟丟 今天太累累了 😢 直播暫停乙次 王董想要替代丟丟當家 睡覺去 ~~ 可以 (到來... | \n", "https://www.facebook.com/diudiu333 | \n", "1723714034327589 | \n", "3136987929666852 | \n", "145.0 | \n", "6943.0 | \n", "9.0 | \n", "131.0 | \n", "0.0 | \n", "23.0 | \n", "6847.0 | \n", "59.0 | \n", "0.0 | \n", "0.0 | \n", "14.0 | \n", "2022-01-03 12:42:18 | \n", "
80 rows × 18 columns
\n", "| \n", " | NAME | \n", "TIME | \n", "MESSAGE | \n", "LINK | \n", "PAGEID | \n", "POSTID | \n", "COMMENTCOUNT | \n", "REACTIONCOUNT | \n", "SHARECOUNT | \n", "DISPLAYCOMMENTCOUNT | \n", "ANGER | \n", "DOROTHY | \n", "HAHA | \n", "LIKE | \n", "LOVE | \n", "SORRY | \n", "SUPPORT | \n", "TOTO | \n", "WOW | \n", "UPDATETIME | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "馬英九 | \n", "2022-01-01 08:00:56 | \n", "今天是中華民國一百一十一年元旦,清晨的氣溫很低,看著國旗在空中飄揚,心中交集著感謝與感慨之情... | \n", "https://www.facebook.com/MaYingjeou/ | \n", "118250504903757 | \n", "4936286629766763 | \n", "1243.0 | \n", "38906.0 | \n", "403.0 | \n", "1022.0 | \n", "5.0 | \n", "NaN | \n", "22.0 | \n", "38254.0 | \n", "458.0 | \n", "4.0 | \n", "157.0 | \n", "NaN | \n", "6.0 | \n", "2022-01-05 23:16:30 | \n", "
| 1 | \n", "馬英九 | \n", "2021-12-21 17:05:14 | \n", "冬至天氣冷颼颼,來一碗熱熱的燒湯圓 🤤 大家喜歡吃芝麻湯圓還是花生湯圓呢? #冬至 #吃湯圓... | \n", "https://www.facebook.com/MaYingjeou/ | \n", "118250504903757 | \n", "4893842230677870 | \n", "1047.0 | \n", "28432.0 | \n", "179.0 | \n", "844.0 | \n", "15.0 | \n", "NaN | \n", "64.0 | \n", "27961.0 | \n", "325.0 | \n", "2.0 | \n", "55.0 | \n", "NaN | \n", "10.0 | \n", "2022-01-05 23:16:30 | \n", "
| 2 | \n", "馬英九 | \n", "2021-12-17 17:00:00 | \n", "公民投票是憲法賦予人民的權利 1218,穿得暖暖,出門投票! #四個都同意人民最有利 #返鄉... | \n", "https://www.facebook.com/MaYingjeou/ | \n", "118250504903757 | \n", "4874626962599397 | \n", "1883.0 | \n", "45037.0 | \n", "519.0 | \n", "1372.0 | \n", "31.0 | \n", "NaN | \n", "63.0 | \n", "43941.0 | \n", "336.0 | \n", "16.0 | \n", "634.0 | \n", "NaN | \n", "16.0 | \n", "2022-01-05 23:16:30 | \n", "
| 3 | \n", "馬英九 | \n", "2021-12-11 09:00:00 | \n", "公投前最後一個周末,你收到投票通知單了嗎? 萊劑不是保健品 公投重新綁大選 三接興建傷藻礁⋯... | \n", "https://www.facebook.com/MaYingjeou/ | \n", "118250504903757 | \n", "4846491325412961 | \n", "3254.0 | \n", "39639.0 | \n", "809.0 | \n", "1742.0 | \n", "60.0 | \n", "NaN | \n", "78.0 | \n", "38702.0 | \n", "196.0 | \n", "11.0 | \n", "568.0 | \n", "NaN | \n", "24.0 | \n", "2022-01-05 23:16:30 | \n", "
| 4 | \n", "馬英九 | \n", "2021-11-14 13:52:11 | \n", "鯉魚潭鐵三初體驗🏊♂️🚴♂️🏃♂️ 順利完賽💪💪💪 #2021年花蓮太平洋鐵人三項錦標... | \n", "https://www.facebook.com/MaYingjeou/ | \n", "118250504903757 | \n", "4764707240258037 | \n", "2409.0 | \n", "73538.0 | \n", "501.0 | \n", "2017.0 | \n", "15.0 | \n", "NaN | \n", "41.0 | \n", "72407.0 | \n", "557.0 | \n", "4.0 | \n", "292.0 | \n", "NaN | \n", "222.0 | \n", "2022-01-05 23:16:30 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 475 | \n", "馬英九 | \n", "2015-10-19 18:10:33 | \n", "「六堆忠義祠」源於清康熙60年,六堆團練協助平定朱一貴起事,清廷興建了「西勢忠義亭」奉祀在戰... | \n", "https://www.facebook.com/MaYingjeou/ | \n", "118250504903757 | \n", "1050731538322311 | \n", "1176.0 | \n", "21803.0 | \n", "316.0 | \n", "705.0 | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "21801.0 | \n", "2.0 | \n", "0.0 | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "2022-01-05 23:16:30 | \n", "
| 476 | \n", "馬英九 | \n", "2015-10-12 18:07:34 | \n", "\n", " | https://www.facebook.com/MaYingjeou/ | \n", "118250504903757 | \n", "1047937641935034 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "2022-01-05 23:16:30 | \n", "
| 477 | \n", "馬英九 | \n", "2015-10-11 13:40:25 | \n", "\n", " | https://www.facebook.com/MaYingjeou/ | \n", "118250504903757 | \n", "1047435965318535 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "2022-01-05 23:16:30 | \n", "
| 478 | \n", "馬英九 | \n", "2015-10-10 14:08:46 | \n", "今天是民國104年國慶日,讓我們一起祝中華民國生日快樂! 今年的國慶日,還有特別的意義。今年... | \n", "https://www.facebook.com/MaYingjeou/ | \n", "118250504903757 | \n", "1047031385358993 | \n", "2219.0 | \n", "36765.0 | \n", "679.0 | \n", "1357.0 | \n", "3.0 | \n", "NaN | \n", "1.0 | \n", "36761.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "2022-01-05 23:16:30 | \n", "
| 479 | \n", "馬英九 | \n", "2015-10-09 19:58:31 | \n", "故宮博物院於民國14年的國慶日成立,今年是它的90歲生日。故宮成立後幾經波折,對日抗戰爆發後... | \n", "https://www.facebook.com/MaYingjeou/ | \n", "118250504903757 | \n", "1046725805389551 | \n", "993.0 | \n", "14761.0 | \n", "306.0 | \n", "590.0 | \n", "3.0 | \n", "NaN | \n", "0.0 | \n", "14757.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "2022-01-05 23:16:30 | \n", "
480 rows × 20 columns
\n", "| \n", " | ACTORID | \n", "NAME | \n", "GROUPID | \n", "POSTID | \n", "TIME | \n", "CONTENT | \n", "COMMENTCOUNT | \n", "SHARECOUNT | \n", "LIKECOUNT | \n", "UPDATETIME | \n", "
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "\"100008360976636\" | \n", "蔣忠祐 | \n", "197223143437 | \n", "10161825149923438 | \n", "2022-01-04 19:28:20 | \n", "小弟想問一下,因為本身只寫過python也只會ML其他的都不太熟悉。 如果我今天想要寫一個簡... | \n", "49 | \n", "7 | \n", "42 | \n", "2022-01-05 22:47:28 | \n", "
| 1 | \n", "1372450133 | \n", "陳柏勳 | \n", "197223143437 | \n", "10161825076258438 | \n", "2022-01-04 18:16:42 | \n", "各位高手大大 小弟剛自學 python 我先在 Jupyter 寫一個檔案 並存檔為Hell... | \n", "14 | \n", "0 | \n", "10 | \n", "2022-01-05 22:47:28 | \n", "
| 2 | \n", "\"100029978773864\" | \n", "沈宗叡 | \n", "197223143437 | \n", "10161818603678438 | \n", "2022-01-01 14:46:35 | \n", "在練習網路上的題目: https://tioj.ck.tp.edu.tw/problems/... | \n", "15 | \n", "3 | \n", "3 | \n", "2022-01-05 22:47:28 | \n", "
| 3 | \n", "696780981 | \n", "MaoYang Chien | \n", "197223143437 | \n", "10161826718243438 | \n", "2022-01-05 11:29:06 | \n", "這個開源工具使用 python、flask 等技術開發 如果你用「專案管理」的方法來完成一堂... | \n", "0 | \n", "9 | \n", "24 | \n", "2022-01-05 22:47:28 | \n", "
| 4 | \n", "\"100006223006386\" | \n", "劉奕德 | \n", "197223143437 | \n", "10161816780833438 | \n", "2021-12-31 23:20:43 | \n", "想詢問版上各位大大 近期在做廠房的耗電量改善 想到python可以應用在建立模型及預測 請問... | \n", "37 | \n", "2 | \n", "29 | \n", "2022-01-05 22:47:28 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 3935 | \n", "\"105673814305452\" | \n", "用圖片高效學程式 | \n", "105673814305452 | \n", "10160837021273438 | \n", "2020-12-27 13:00:34 | \n", "【圖解演算法教學】Bubble Sort 大隊接力賽 這次首次嘗試以「動畫」形式,來演示Bu... | \n", "2 | \n", "9 | \n", "14 | \n", "2022-01-05 22:47:28 | \n", "
| 3936 | \n", "\"100051655785472\" | \n", "劉昶林 | \n", "197223143437 | \n", "10160814944163438 | \n", "2020-12-19 23:56:39 | \n", "嗨各位, 我現在有個問題,我想要將yolov4(darknet) 跟 Arduino做結合,... | \n", "34 | \n", "13 | \n", "72 | \n", "2022-01-05 22:47:28 | \n", "
| 3937 | \n", "\"100002220077374\" | \n", "Hugh LI | \n", "197223143437 | \n", "10160840505808438 | \n", "2020-12-28 20:01:25 | \n", "想問一下各位一個蠻基礎的問題 (先撇開其他條件) c=s[a:b] c[::-1] 為甚麼... | \n", "10 | \n", "1 | \n", "12 | \n", "2022-01-05 22:47:28 | \n", "
| 3938 | \n", "504805657 | \n", "Roy Hsu | \n", "197223143437 | \n", "10160837961533438 | \n", "2020-12-27 22:27:50 | \n", "我想實現 用python 去更換 資料夾的icon ( windows ) 網路上看到這個 ... | \n", "6 | \n", "1 | \n", "3 | \n", "2022-01-05 22:47:28 | \n", "
| 3939 | \n", "\"100001849455014\" | \n", "Vivi Chen | \n", "197223143437 | \n", "10160837910348438 | \n", "2020-12-27 22:05:17 | \n", "Test 的 SayHello 函式在初始化後, 就不用再輸入'Amy' 最終目的就是只要在... | \n", "6 | \n", "1 | \n", "9 | \n", "2022-01-05 22:47:28 | \n", "
3940 rows × 10 columns
\n", "