<?xml version='1.0' encoding='utf-8'?>
<eprints xmlns='http://eprints.org/ep2/data/2.0'>
  <eprint id='https://researchdata.bbk.ac.uk/id/eprint/154'>
    <eprintid>154</eprintid>
    <rev_number>17</rev_number>
    <documents>
      <document id='https://researchdata.bbk.ac.uk/id/document/667'>
        <docid>667</docid>
        <rev_number>5</rev_number>
        <files>
          <file id='https://researchdata.bbk.ac.uk/id/file/2119'>
            <fileid>2119</fileid>
            <datasetid>document</datasetid>
            <objectid>667</objectid>
            <filename>4chan-data-april-to-june.json</filename>
            <mime_type>text/plain</mime_type>
            <hash>29efec7332d447bcd25ec7ee22791bbf</hash>
            <hash_type>MD5</hash_type>
            <filesize>3927843935</filesize>
            <mtime>2021-07-16 22:10:43</mtime>
            <url>https://researchdata.bbk.ac.uk/id/eprint/154/1/4chan-data-april-to-june.json</url>
          </file>
        </files>
        <eprintid>154</eprintid>
        <pos>1</pos>
        <placement>1</placement>
        <mime_type>text/plain</mime_type>
        <format>text</format>
        <formatdesc>4chan dataset from 1st of April 2021 to 1st of July 2021</formatdesc>
        <language>en</language>
        <security>public</security>
        <license>odc_by</license>
        <main>4chan-data-april-to-june.json</main>
        <content>data</content>
        <prog_language>JSON</prog_language>
      </document>
    </documents>
    <eprint_status>archive</eprint_status>
    <userid>102</userid>
    <dir>disk0/00/00/01/54</dir>
    <datestamp>2021-07-29 16:05:41</datestamp>
    <lastmod>2026-04-14 11:12:20</lastmod>
    <status_changed>2021-07-29 16:05:41</status_changed>
    <type>data_collection</type>
    <metadata_visibility>show</metadata_visibility>
    <creators>
      <item>
        <name>
          <family>Prifti</family>
          <given>Ylli</given>
        </name>
      </item>
    </creators>
    <title>4chan /pol board as a temporary evolution of live threads and posts.</title>
    <subjects>
      <item>CMS</item>
    </subjects>
    <divisions>
      <item>csis</item>
    </divisions>
    <full_text_status>public</full_text_status>
    <keywords>4chan, live board, threads, posts</keywords>
    <abstract>We monitored the live /pol board on 4chan and scraped data from each thread multiple times during their time in the board and additionally (but not included in this dataset), scraped the archive status of the thread as exposed by the 4chan API and 4pleb (internet archiving service) API.

Included in this dataset there is a three month extract of data scraped from 4chan during the period from 1st of April 2021 to 1st of July 2021</abstract>
    <date>2021-07-16</date>
    <publisher>Birkbeck College, University of London</publisher>
    <id_number>10.18743/DATA.00154</id_number>
    <funders>
      <item>
        <funders>no_funder</funders>
      </item>
    </funders>
    <agreement>
      <item>yes</item>
    </agreement>
    <ret_info>
      <item>
        <ret_date>2031-07-29</ret_date>
      </item>
    </ret_info>
    <research_centre>bida</research_centre>
    <record_type>metadata_and_data_files</record_type>
    <collection_method>We used a distributed scraping system to collect this information and pointed a combination of 14 nodes, 3 clusters and 100 running instances per minute to scrap the data at high frequency, due to the ephemeral characteristics of 4chan. 

The scraping system was configure as a breath-first scraping system and MongoDB was used as document storage configured as a replica-set made of one primary, three secondaries and one arbiter.</collection_method>
    <provenance>Extracted as JSON from MongoDB Source</provenance>
    <legal_ethical>This is a collection of publicly available, anonymous at source and otherwise ephemeral data.</legal_ethical>
    <collection_date>
      <date_from>2021-04-01</date_from>
      <date_to>2021-07-01</date_to>
    </collection_date>
    <temporal_cover>
      <date_from>2021-04-01</date_from>
      <date_to>2021-07-01</date_to>
    </temporal_cover>
  </eprint>
</eprints>
