Build an production-ready Node.JS API library that provides image indexing & search capability, and the documentation of how to use it and deploy it.
It supports
A Typescript library developed using tsdx and will publish to npm.
cofacts/media-manager
@cofacts/media-manager
The manager will act as an interface to the underlying storage, providing simple search (listing / indexing), get and insert functionality.
Expected Usage:
import MediaManager from '@cofacts/media-manager';
// Setup
const manager = new MediaManager({
credentialsJSON: process.env.GCS_CREDENTIALS,
bucketName: process.env.GCS_BUCKET_NAME,
});
// Search
const { hits } = await manager.query({url: 'https://......'});
// Upload and index
const { id, url } = await manager.insert({url: 'https://......'});
MediaManager
class MediaManager {
constructor(params: {
/** Google cloud credentail JSON content of a service account. */
credentialsJSON: string;
/**
* Existing GCS bucket. The service account of `credentialsJSON` needs to
* have the following permission of this bucket:
* - roles/storage.objectCreator
* - roles/storage.objectViewer
*/
bucketName: string;
/**
* The prefix to write media files.
* File structure after this prefix is managed by MediaManager
*/
prefix?: string;
}) {}
// The GCS Bucket object
#bucket: Bucket;
query({url: string}): Promise<SearchResult> {}
insert({url: string}): Promise<InsertResult> {}
// Get file by ID from GCS
getContent(id: string): ReadableStream {}
// Get file info by ID from GCS. Null if specified ID does not exist.
getInfo(id: string): FileInfo | null {}
}
#query({url: string}): Promise<SearchResult>
It can return multiple search hit for images; one hit (exact match) for videos, audios or other formats.
The reason of why video & audio can only have exact match result:
interface SearchResult {
/** metadata for the queried file */
queryInfo: QueryInfo;
hits: SearchHit[];
}
interface SearchHit {
/** Similarity between 0 and 1 */
similarity: number;
/** Metadata for the file */
info: FileInfo;
}
enum MediaType {
IMAGE = 'image'
AUDIO = 'audio'
VIDEO = 'video'
FILE = 'file'
}
interface FileInfo {
/**
* The unique ID for the file.
* The ID is considered opaque;
* Applications should not try to decipher this ID. */
id: string;
/** Public URL */
url: string;
type: MediaType;
// Extension from https://www.npmjs.com/package/mime
ext: string;
// MIME string
mime: string;
createdAt: Date;
}
/** ID is the to-be ID if the file is being inserted into database. */
type QueryInfo = Pick<FileInfo, 'id' | 'type' | 'format' | 'mime'>;
The promise may reject with the following errors:
/** Cannot download file from the specified URL */
class DownloadError {
// Error thrown by node-fetch if applicable
fetchError: Error;
}
#insert({url: string, onUploadStop?: (Error | null) => void}): Promise<FileInfo>
This method will upload file of the given url
to GCS. Files with identical or near duplicate image content will produce the same perceptual hash or file fingerprint, so there will be no duplicates on GCS.
insert()
resolves as soon as all data in FileInfo
is resolved. Among all fields in FileInfo
, id
should be the slowest to retrieve (url
is resolved along with id
.)
By the time insert()
resolves, it is possible that file upload to GCS is still in progress. When upload succeeded, onUploadStop(null)
will be called. If upload fails, onUploadStop(err)
will be called, passing the err
returned by GCS NodeJS API.
query()
query()
with this resultSEARCH_HASH
and ID_HASH
SEARCH_HASH
is calculated, search GCS using the searchID_HASH
and each file’s ID_HASH
insert()
f
with temporary namef
onUploadStop
<file_type>/<hash>
exist on GCSf
to <file_type>/<hash>
; otherwise, delete f
insert()
image/<SEARCH_HASH>/<ID_HASH>
exist on GCSf
to image/<SEARCH_HASH>/<ID_HASH>
; otherwise, delete f
insert()
f
不一定要做
The maximum file size user can send to LINE is a bit vague. It is said that the file can be as large as 1GB (for fesktop version).
https://tw.imyfone.com/line/line-sends-files-size-limit/
It is said that files over 20MB cannot be retrieved by get content API.
https://www.line-community.me/en/question/5e306351851f7402cd95223f/get-file-content-of-messaging-api-larger-than-20mb?loginnow=true
After decoding into raw data, file size can be so large that it is not recommended to fit all of them into memory.
We can use the following strategies to avoid processing big file in memory:
LINE already downsize images, videos by limiting their dimensions. Among all images we previously collected, image files very rarely exceed 2MB.
As for file upload, it can be more complex because if we cannot buffer whole file into memory, we cannot get file hash before uploading the whole file. To keep the response stream flowing, we must calculate hash and upload to GCS at the same time. After hash is calculated, we rename the file on GCS if the file previously did not exist.
LINE file test
212MB --> Cannot upload QQ
Video cofacts_Final_TW.mp4
It seems that it is possible for LINE to have video files over 50MB. We will
LINE’s getContent
API provides content-type header. This provides us a tool to plan the processing route before reading any byte from body stream.
We can parse content-type header and use its type to determine MediaType
accordingly:
image/*
to MediaType.IMAGE
video/*
to MediaType.VIDEO
audio/*
to MediaType.AUDIO
MediaType.FILE
As an alternative, we can also use file-type package:
Assert type from file content can make media manager more robust.
Since we are using different stream pipeline when different types of data is received, using file-type is more complicated than usung response header directly.
For the sake of simplicity we use HTTP response’s content-type directly.
bits=16
(256 bits in length) hash as dedup ID (ID_HASH
)bits=6
(36 bits in length) to search (SEARCH_HASH
) - see hash performance here<prefix>/image/<SEARCH_HASH>/<ID_HASH>
<prefix>/image/SEARCH_HASH
prefix as search hits
ID_HASH
and the search hits’ ID_HASH
.buffer-xor
and table lookup to get the hamming distance, then return the portion of the number of identical bits as similarity.id
in FileInfo
can be the file path image/<ID_HASH>
, as it must uniquely identify a media file (videos and audios included).Use Elasticsearch to index images and perform search instead.
MediaManager
takes a elasticsearch client
and an index name (MediaEntry
index).MediaManager
initializes, it checks if MediaEntry
index exist. If not, it tries to create the index.MediaEntry
:
<ID_HASH>
b1
~ b4
, 64-bit long int of part of ID_HASH
. This is for sorting the search results.SEARCH_HASH
.fuzzy
query is used, we can store just store 1 field, the SEARCH_HASH
.Ibs
and Obs
, recording the index of 1
and 0
. The length of Ibs
and Obs
sums up to the lengh of SEARCH_HASH
(36).SEARCH_HASH
(each are 12 bits in length).ID_HASH
for image file’s GCS path.SEARCH_HASH
or indexing methodsAll features comes from Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines.
We directly perform fuzzy
query on the SEARCH_HASH
itself.
The SEARCH_HASH
can be stored in binary form (a term with 36 charactors, either ‘0’ or ‘1’) , or other forms (such as base-4 strings, allowing up to 2-bit distance per difference in charactor).
1
s and 0
s in 2 field, Ibs
and Obs
Suppose 3 subcode (12 bits in lengh) is used, and we are finding 3-neighbor (which means 1-neighbor for each subcode)
FENSHES will generate a bool query with 3 terms
query, each terms query would have 13 candidate terms.
If we use fuzzy
instead of terms
query, we can calculate up to 6-neighbor because we can perform 2-neighbor search for each subcode. Not sure of this will include too many results, though.
In the current design, we use the file content’s ID hash as file name to perform deduplication. Therefore, the following data are subject to change if hashing method updates, but they are outside of MediaManager
’s administration:
articles.attachmentHash
)articles.attachmentUrl
)If we plan to update hashing algorithm one day in the future, a strategy of doing so may be:
MediaManager
’s prefix
along with the hashing algorithm update so that new media files are sent to new prefix.
MediaManager
again; however, as Cofacts relies on file ID as foreign key, it will not map to any Cofacts article yetmediaentries
index in this stepattachmentUrl
/ attachmentHash
in Cofacts database according to the generated mapping in step 2.In previous research, we plan to extract frames from video and use chromaprint for audio.
However, for the time being we do not have enough real samples from LINE to evaluate the effectiveness of the hashing / fingerprinting technique.
Furthermore, we found that if the user saves audio and video from LINE and send to LINE again, the files Messaging API provides are exactly identical.
MyGoPen said they also use file hash. However, they observed that in some weird phones, the file may change (by a few bytes) when its user downloads and re-uploads to LINE.
Therefore, we will just process these files in the same way as other files (see below).
For other file formats like PDF, doc, zip, the user is less likely to mutate them when sharing, and LINE does not mutate them either. We can just use file fingerprinting to dedup.
We can use NodeJS native crypto
package to generate a SHA-256 hash as its finterprint for each file. It supports streaming.
<prefix>/image/<SEARCH_HASH>/<ID_HASH>/original
<prefix>/image/<SEARCH_HASH>/<ID_HASH>/jpg
<prefix>/image/<SEARCH_HASH>/<ID_HASH>/webp
<prefix>/video/<ID_HASH>/original
<prefix>/video/<ID_HASH>/preview
(240p & 16fps, usually played in 1.5x speed, 24kbps aac_he_v2 or aac vbr1 encoded)<prefix>/video/<ID_HASH>/jpg
of first frame, 240p<prefix>/video/<ID_HASH>/webp
of first frame, 240p<prefix>/video/<ID_HASH>/highlight
(240p & 24fps, no sound, 6 seconds long, 1.5x speed)<prefix>/audio/<ID_HASH>/original
<prefix>/audio/<ID_HASH>/preview
(aac vbr1, mono)<prefix>/video|audio/<ID_HASH>/original.gz
gzip
flag in createWriteStream
getSignedUrl(fileType, expires)
to FileInfo
FileInfo.url
still returns public URL to the original file.
fileType
:
'original' | 'jpg' | 'webp'
for images
webp
& jpg
are for preview in chatbot & website, max 512px width or height'original'
| 'preview'
| 'jpg'
| 'webp'
| 'highlight'
for videos'original'
| 'preview'
for audio'original'
only for filesexpires
:
rumors-api
integration
attachmentUrl
field does not read from elasticsearch; it calls getSignedUrl
insteadattachmentUrl
field will take an optional fileType
argument
webp
is chosen for images; preview
chosen for videos and audio; original
is chosen for other filesoriginal
getSignedUrl()
, we can choose a expire
date that is fixed for 24 hours so that the URL can be cached by the browser
expires
to 23:59 of the next day. Every request to the image today will generate the same URL using the same expires
.FileInfo
Some info that may help with rendering and features for editors to quickly recognize if two media are identical.
Separation of concern between Media Manager and Cofacts API
FileInfo
) and has no idea about search hash and search logicsequenceDiagram
participant User
participant Cofacts API
participant Elasticsearch
participant Media Manager
participant GCS
User->>+Cofacts API: ListArticles(mediaUrl)
Cofacts API->>+Media Manager: query({url: mediaUrl})
Note over Media Manager: fetch header
opt big image file
Note over Media Manager: resize
end
Note over Media Manager: generate hashes
Media Manager->>+GCS: bucket.getFiles({prefix})
GCS->>-Media Manager: files with prefix
Note over Media Manager: sort with hash dist
Media Manager->>Cofacts API: Search hits with ID_HASH
deactivate Media Manager
Cofacts API->>+Elasticsearch: terms query by ID_HASH[]
Elasticsearch->>-Cofacts API: search hits for articles
Cofacts API->>User: Search result
deactivate Cofacts API
sequenceDiagram
participant User
participant Cofacts API
participant Elasticsearch
participant Media Manager
participant GCS
User->>+Cofacts API: CreateMediaArticle(mediaUrl)
Cofacts API->>+Media Manager: insert({url: mediaUrl})
Note over Media Manager: fetch header
Media Manager->>+GCS: Pipe to file under temp name
alt is not image
Note over Media Manager: SHA256 ID_HASH
else is image
opt big image file
Note over Media Manager: resize
end
Note over Media Manager: 36b SEARCH_HASH
Note over Media Manager: 256b ID_HASH
end
Note over Media Manager, GCS: check if file name exists on GCS
alt does not exist
Note over Media Manager, GCS: Rename temp name with hash
else exists
Note over Media Manager, GCS: delete file in current temp name
end
Media Manager->> Cofacts API: File info
deactivate Media Manager
GCS->>Media Manager: Upload complete
deactivate GCS
Note over Media Manager: call onUploadStop()
Cofacts API->>+Elasticsearch: term query by ID_HASH
Elasticsearch->>-Cofacts API: search hits for articles
opt article not exist
Cofacts API->>+Elasticsearch: Index new article with hash
Elasticsearch->>-Cofacts API: new doc ID
end
Note over Cofacts API, Elasticsearch: createOrUpdateReplyRequest()
Cofacts API->>User: Article ID
deactivate Cofacts API