零時資訊傳播資料標準 0archive Data Standard

--- tags: disinfo --- # 零時資訊傳播資料標準 0archive Data Standard ## Language The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). ## Glossary * Producer * Publication * Author * Dataset compiler > All "identifier" fields in the followings may be a "localized" identifier for the entity issued by the publisher of the data set. They can be anonymized through reindexing or hashing algorithms. Such a localized identifier should be prefixed with an identifier of the publisher in an [org-id](http://org-id.guide/)-like practice. ## Classes and properties ### Producer The Producer class should have the following properties: * **id**: a unique identifier within the scope of this dataset for the content producer. * **name**: a name of the company, organization, discussion board, social network profile, or other kinds of information channel that produces contents. * **alternate names**: shorter names, abbreviates, names of the producer in various languages of the producer that are in common use. * **identifiers**: ways to identify the producer, such as its legal entity if there is one, its tracking IDs used in services such as Google Analytics, or others. * **description**: a short description of the content producer given by the dataset compiler. * **classification**: the classification of the content producer given by the dataset compiler. * **url**: an URL to the web site, online news outlet, discussion board, social network profile, where the content made by the producer is published. * **languages**: languages primarily used to produce the contents. * **licenses**: licenses primarily used on the contents. * **date of first seen**: the publishing date and time of the earliest content by the producer included in the dataset. * **date of last update**: the publishing date and time of the newest content by the producer included in the dataset. * **followership**: number of subscribers, followers, users who "like" the channel, or other information about the followership of the producer. #### JSON Serialization * The name `other_names` is used instead of alternate names, whose value is a list of name strings. * The value of `identifiers` property is a list of identifier objects. An identifier object has `scheme` and `identifier` propreties. The dataset compiler is responsible for specifying the identifier objects it uses in the dataset. * The name `first_seen_at` is used instead of date of first seen. * The name `last_update_at` is used instead of date of last update. * The value of `followership` property is a followership object. The dataset compiler is responsible for specifying the format of followership objects it uses in the dataset. #### JSON Schema TODO #### Refereneces * Ronny's newsdiff [data](https://github.com/ronnywang/newsdiff/blob/master/webdata/stdlibs/url-normalizer.js/map.csv) * Gugod's people-in-news [news-sites.txt](https://github.com/g0v/people-in-news/blob/master/etc/news-sites.txt) * 零時檔案局 [Sites](https://airtable.com/shrd0utGHlTWmQsYt) * Schema.org: [Organization](https://schema.org/Organization), [Person](https://schema.org/Person), [Project](https://schema.org/Project) * "Account information" in [Twitter datasets](https://transparency.twitter.com/en/information-operations.html) * Popolo standard: [Organization](https://www.popoloproject.com/specs/organization.html) * Open Contracting Data Standard [Organization](https://standard.open-contracting.org/latest/en/schema/reference/#organization) * * "[Followership and social media marketing](https://www.researchgate.net/publication/282685486_Followership_and_social_media_marketing)", "[Social Media Followership as a Predictor of News Website Traffic](https://www.tandfonline.com/doi/abs/10.1080/17512786.2019.1635040?journalCode=rjop20)" ### Publication The Publication class should have the following properties: * **id**: a unique identifier within the scope of this dataset for the publication. * **version**: version of this copy of the publication. * **identifiers**: ways to identify the publication, such as ID used by popular content databases. * **producer_id**: identifier of the producer that made the publication. * **canonical_url**: an URL to the publication, normalized to be a unique identifier within the scope of this dataset. * **title**: title of the publication. * **text**: text of the publication. * **author**: author of the content of the publication. * **about**: the subject this publication is replying to, such as on Twitter. * **language**: language used by text of the publication. * **license**: license of the publication. * **date of publication**: the date and time when the content was published. * **date of first seen**: the date and time when the publication was first observed by the dataset compiler. * **date of last update**: the date and time when the latest update to the content was observed by the dataset compiler. * **urls**: URLs, excluding the ones pointing to the publication itself, appeared in the text of the publication. * **hashtags**: hashtags, as used in Twitter tweets and Facebook posts, used in the content. * **mentions**: mentions to other users, as used in Twitter tweets and Facebook posts, made in the content. * **keywords**: keywords of the content as given by the producer in meta-tags or in the publication. * **reactions**: numbers of reader reactions, such as Facebook "like" or other emotions, Twitter "like", Medium "claps", or user views. * **comments**: comments to the publication, such as PTT and Facebook replies. * **internet connection**: information about the connection used to publish this content, such as IP address on PTT articles, or geolocation of Twitter tweets. #### JSON Serialization * The value of `identifiers` property is a list of identifier objects. An identifier object has `scheme` and `identifier` propreties. The dataset compiler is responsible for specifying the identifier objects it uses in the dataset. * The name `publication_text` is used instead of text. * The name `reply_to` is used instead of about. * The name `published_at` is used instead of date of publication. * The name `first_seen_at` is used instead of date of first seen. * The name `last_update_at` is used instead of date of last update. * The value of `reactions` property is a reactions object. The dataset compiler is responsible for specifying the format of reactions objects it uses in the dataset. * The value of `comments` property is a list of comment objects. * The name `connect_from` is used instead of internet connection. #### JSON Schema TODO #### References * Schema.org: [Article](https://schema.org/Article), [WebPage](https://schema.org/WebPage), [Message](https://schema.org/Message), [SocialMediaPosting](https://schema.org/SocialMediaPosting) * "articles" in [cofacts opendata](https://github.com/cofacts/opendata) * "Tweet information" in [Twitter datasets](https://transparency.twitter.com/en/information-operations.html) * Popolo standard [Speech](https://www.popoloproject.com/specs/speech.html), [Motion](https://www.popoloproject.com/specs/motion.html) * Gugod's people-in-news [Article.pm](https://github.com/g0v/people-in-news/blob/master/lib/Sn/Article.pm), [NewsExtractor::Article](https://github.com/perltaiwan/NewsExtractor/blob/master/lib/NewsExtractor/Article.pm) ### Comment The Comment class should have the following properties: * **id**: a unique identifier within the scope of all comments about the same subject. * **author**: author of the comment. * **text**: text of the comment. * **date of publication**: the date and time when the comment was published. * **about**: the subject this comment is in reply to. * **reactions**: numbers of reader reactions, such as Facebook “like” or other emotions, upvotes or downvotes. * **internet connection**: information about the connection used to publish this content, such as IP address on PTT comments. #### JSON Serialization * The name `comment_text` is used instead of text. * The name `published_at` is used instead of date of publication. * The name `reply_to` is used instead of about. * The value of `reply_to` property, if the subject of this comment is another comment, is the `id` of the other comment. If the subject of this comment is the publication it belongs to, there need not be a `reply_to` value. * The value of `reactions` property is a reactions object. The dataset compiler is responsible for specifying the format of reactions objects it uses in the dataset. * The name `connect_from` is used instead of internet connection. #### JSON Schema TODO ## Specifications by 0archive ### Publication * `version` is UNIX timestamp of the time when the copy of the publictaion was archived. ## Serialization and packaging The following serialization scheme is supported: * JSON Lines Datasets are packaged in the following format: * Frictionless Data [Data Package](https://frictionlessdata.io/specs/data-package/) ## Similar projects * [The OpenArchive Project](https://open-archive.org/) * [Civil Archive](https://grants.g0v.tw/projects/586a7be0a327a4001ee49126) in g0v grants * [WARC](https://en.wikipedia.org/wiki/Web_ARChive) # Work in progress The followings are pending features to the standard that may or may not going to be included in the final draft. ## Classes and properties ### Media * identifier * canonical_url * author * title * description * location * content_type * content_length * content * content_hash * original_filename * first_seen_at * last_updated_at * tags #### References * [Open Archive "SAVE Space" Specification](https://github.com/OpenArchive/Save-app-android/blob/master/docs/OpenArchiveSpaceCapsuleSpec.md) * Schema.org [MediaObject](https://schema.org/MediaObject) ### Claim #### References * Schema.org [Claim](https://schema.org/Claim) ### ClaimReview * summary * canonical_url * item_reviewed * review_body * review_rating * references #### References * Schema.org [ClaimReview](https://schema.org/ClaimReview)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.