Eric Kim’s Philosophy: Simplicity, Emotion & Storytelling
Eric Kim – a noted street photographer and visual philosopher – preaches that a powerful photograph must convey a clear message and emotion. In his view, “a picture without a message is useless,” and composition is the tool to direct the viewer’s eye to what’s important . Rather than generic “good” or “bad” composition, Kim emphasizes making images dynamic, edgy, and human – avoiding static or overly symmetrical frames . Key tenets of his style include:
- Simplicity: Clean, uncluttered backgrounds and simple forms, which heighten the impact of the subject . He often notes “simplicity is the ultimate sophistication” – a simple backdrop with one clear subject (strong figure-to-ground separation) makes any dynamic element stand out .
- Dynamic Tension: Compositions should feel energetic and “off-kilter.” Kim encourages asymmetry and diagonal lines instead of static symmetry or boring centering . Techniques like shooting from low or high angles, using wide lenses, and even tilting the camera (Dutch angle) create a sense of motion and tension in the frame .
- Human-Centric Emotion: Above all, Kim’s photos prioritize human stories and emotions. He looks for expressive subjects – candid laughter, intense eyes, or poignant body language – capturing “dynamic emotions” that give the image soul . The goal is to evoke feeling in the viewer, making the photo a narrative moment, not just a pretty picture. In short, composition and technique serve the story – the human experience caught in that decisive moment.
Any AI system aiming to select “the best” photos in Kim’s style must therefore go beyond basic sharpness or rule-of-thirds. It should gauge whether an image is clean yet powerful in composition, emotionally resonant, and tells a human story. This is a subjective, nuanced task, far beyond mere object recognition or technical quality checks.
AI Models for Aesthetic & Composition Analysis
Research in computational aesthetics provides a starting point. Modern image assessment models use deep learning to predict how humans judge photo quality. One prominent example is Google’s Neural Image Assessment (NIMA), a CNN trained on large datasets of user ratings. NIMA predicts a score distribution for an image, which can be reduced to a mean aesthetic score used to rank photos by beauty and impact . Notably, NIMA’s rankings on the AVA photo dataset correlate closely with average human opinions , showing that AI can learn to approximate general aesthetic preferences. Such models consider factors like color, lighting, composition balance, etc., and have been shown to score images with high reliability relative to human perception .
Another relevant line of work is composition quality assessment. The CADB (Composition Assessment Database) introduced a dataset and model specifically for scoring photographic composition . It labels images with composition ratings (1–5) and even identifies classical composition patterns (e.g. rule of thirds, symmetry, leading lines). The accompanying model, Saliency-Augmented Multi-pattern Pooling (SAMP-Net), was designed to evaluate composition quality and outperformed generic aesthetic scorers on this task . This indicates that domain-specific models can effectively judge how well an image’s layout and framing work. Such a model could discern if a photo has, say, a strong diagonal structure or an off-center subject creating tension – features closely aligned with Kim’s dynamic composition ideals.
Beyond static composition, assessing emotion and storytelling impact remains challenging. Some AI approaches try to analyze images for emotional content by detecting faces and expressions or “sentiment” of the scene. For example, face emotion recognition algorithms can estimate if subjects are happy, sad, surprised, etc., which could serve as a proxy for emotional impact in a photo . There’s also emerging research leveraging multimodal models like OpenAI’s CLIP to evaluate aesthetics and semantics together. Because CLIP was trained on image-text pairs, it encodes not just objects but also style and even “subjective feelings about the image” described in text . In fact, a 2022 study found CLIP’s learned features to be well-suited for Image Aesthetic Assessment, since CLIP’s language supervision forces it to notice composition, lighting, and mood in images . Researchers have successfully “prompted” CLIP to act as an aesthetic judge by comparing an image’s similarity to positive prompts (like “an outstanding picture”) versus negative prompts (like “an atrocious picture”) . This zero-shot prompting approach effectively lets CLIP assign an aesthetic score without any fine-tuning, and more complex prompts can incorporate content (e.g. “a beautiful portrait of a person”) to be context-aware . Such techniques could potentially be extended to prompts about emotional storytelling (for instance, comparing “a photo with powerful emotion and story” vs “a bland, emotionless photo”).
It’s worth noting that no single off-the-shelf model today fully captures “storytelling impact.” AI can detect if a photo is sharp, well-exposed, and even aesthetically composed, but understanding narrative value is an unsolved problem. As one analysis bluntly stated, “human emotion captured in time can never truly be replicated with an algorithm.” AI still fundamentally “sees” pixels and patterns, whereas a human photographer perceives nuance, context, and meaning beyond the frame . Nonetheless, by combining multiple AI evaluations – aesthetics, composition, and content analysis – we can approximate a composite judgment of which images might resonate most with viewers.
Existing Tools & APIs for Subjective Image Analysis
There are already several tools that attempt to quantify image quality or aesthetics, which could be repurposed or serve as components in this project:
- Aesthetic Scoring APIs: Services like Everypixel’s UGC Photo Scoring API will return an aesthetic quality score (0–100%) for a given image . Everypixel’s model was trained on hundreds of thousands of photographs and outputs a score reflecting perceived quality and attractiveness. It combines technical factors (sharpness, exposure) and some aesthetic factors, and categorizes images from “very bad” (blurry, poorly composed) to “excellent” (professional-quality) based on the score range . However, notably “the algorithm doesn’t estimate the [image’s] plot” or the intrinsic coolness of the subject – it focuses on aesthetics and technical quality, not storytelling . This means such APIs can flag well-composed, pretty images but might miss emotional depth or narrative context.
- Composition Analysis Tools: Some AI services specifically target composition. For instance, Nyckel provides a pretrained classifier that simply judges “Well Composed” vs “Poorly Composed” for an input photo . Under the hood, it was trained on a custom dataset labeled by composition quality, and it returns a confidence score indicating how strongly it believes an image has good composition . Nyckel even offers related classifiers like “good subject isolation” or “proper framing” – essentially checking if the main subject stands out and isn’t awkwardly cropped. These could be useful signals: a higher composition score suggests the photo follows proven framing techniques (which likely aligns with simplicity and clarity in the shot). One could use such an API to filter out obviously poor compositions or to rank images by compositional strength.
- Open-Source Models and Libraries: The research community has open implementations of many aesthetic models. For example, the IDEALO Aesthetic Score project provides a ready-to-use model based on NIMA (available in PyTorch/TensorFlow) . With it, a developer can compute aesthetic scores locally for a batch of images. Similarly, the authors of “CLIP knows Image Aesthetics” released their code for prompting CLIP and even a fine-tuned CLIP model for aesthetic scoring . One can leverage such models via libraries (HuggingFace’s Transformers has CLIP, etc.) to avoid training from scratch. For facial emotion recognition, there are also libraries like OpenCV (with pre-trained DNN classifiers for faces/expressions) or deepface in Python that can quantify expressions. These tools can be combined – e.g., use NIMA to get a base aesthetic score, then adjust or annotate images with an emotion score (number of smiling faces, etc.), depending on what’s in the photo.
- Commercial Photo Culling Software: In the photography world, AI-powered culling tools have emerged to help select the best shots from a shoot. Apps like Narrative Select and Aftershoot use AI to pick the sharpest image in a series, detect if subjects’ eyes are open, and even gauge facial expressions to find the most flattering shots . While these are more targeted to event photographers (e.g. weeding out photos where someone blinked), it shows the trend of AI making subjective choices. Another product, Peakto, explicitly offers “AI aesthetic analysis” to sort your library. Peakto computes a global “Aesthetic Score” that combines aesthetics, composition, and technical quality – allowing the user to “instantly identify your most striking photos.” . This illustrates a feasible approach: a weighted combination of multiple factors into one score (for Peakto, likely using an aesthetic model plus checks on composition rules and technical metrics like focus or exposure).
Example from Peakto’s AI analysis: A photo is evaluated for aesthetic impact – note the “Global Score” combining aesthetic and technical scores. Our tool could similarly assign composite scores to rank images.
In summary, there are building blocks available. Aesthetic scoring and composition classifiers can be accessed via APIs or open models. They handle the “easy” part of the problem (technical and basic composition quality). The harder part – evaluating emotion and story – might require custom logic or fine-tuning, as current tools don’t explicitly output “emotion scores.” We might use proxies (like face expression analysis or CLIP with text prompts about mood) to approximate this.
Designing an Eric Kim-Inspired Scoring System
To incorporate Eric Kim’s philosophical lens into the AI, we need to tweak what we measure and how we score: the system should favor images that are simple yet dynamic, and which have human-centered emotional appeal. A feasible approach is to design a composite scoring function that blends several model outputs, each targeting one aspect of Kim’s criteria:
- Base Aesthetic/Quality Score: Use a pre-trained aesthetic model (like NIMA or Everypixel’s API) to get an initial score reflecting overall appeal. This will already account for general composition aesthetics, color harmony, etc. For instance, NIMA’s output (mean score 1–10) or Everypixel’s 0–100% can serve as a base rating of how “attractive” the photo is . High scores likely correlate with good use of light, pleasing composition, and lack of technical flaws – prerequisites for a “keeper” image.
- Composition Strength: Incorporate a composition-specific metric. This could be as simple as the Nyckel composition classifier’s confidence that the photo is well-composed , or a score from a model like SAMP-Net (if one fine-tunes or runs it). Additionally, heuristic checks could be used: e.g., does the subject align roughly with rule-of-thirds or golden ratio points? Is the horizon straight? How balanced is the framing? These can be approximated by analyzing the distribution of edges and objects in the frame (OpenCV can find dominant lines, etc.). The idea is to boost images that “follow a strong compositional pattern” and penalize those that feel awkward or cluttered. Since Kim values asymmetry and diagonals, we might not strictly reward rule-of-thirds, but we do want intentional framing. A photo that scores high here would have a clear structure (leading lines, frames within frame, etc.) that guides the eye – aligning with Kim’s notion of guiding the viewer with composition .
- Simplicity (Clutter Penalty): We want to prefer images with clean backgrounds and singular focus. We can attempt to quantify clutter or simplicity. One approach: detect how many distinct objects or faces are in the scene – fewer can imply simplicity. Another: measure background busyness (perhaps via edge detection or an entropy measure on the background regions). If an image has a shallow depth-of-field (blurred background) or high contrast between subject and background, that indicates strong figure-ground separation (which Kim praises ). We could also use the “subject isolation” classifier from Nyckel for this purpose – it would tell us if the subject stands out well . A high subject isolation score or low clutter metric would nudge the photo up in the rankings, echoing the “keep it simple” mantra.
- Human Presence & Emotion: Because we’re focusing on storytelling and humanity, images containing people – especially showing expressive emotions or interactions – should rank higher. We can integrate a face and emotion detector to address this. For example, run a face detection; if faces are found, check their facial expressions (happy, sad, surprised, etc.). An image of a person smiling, crying, or otherwise emoting could get an “emotion bonus.” We could even gauge body language: a person mid-action or gesture might be considered more dynamic than a static pose (some pose estimation models could classify energetic vs static posture). If an image has no people, it doesn’t automatically mean it’s bad (a silhouette or a scene can still tell a story), but often Kim’s most compelling shots involve human subjects. So this criterion ensures human-centric shots aren’t undervalued by a generic aesthetic model that might otherwise favor a pretty landscape. It aligns with Kim’s advice to “look for emotions in faces, body language, or eyes” when shooting .
- Dynamic Elements: To capture “dynamic tension,” we can look at factors like tilt and motion. Does the image have diagonal lines or is the subject off-center in a striking way? We might compute the angle of lines in the image (via a Hough transform for example) – an abundance of diagonals or a titled horizon could be a proxy for dynamism. Likewise, if a person is captured mid-motion (running, jumping) or if there’s motion blur suggesting movement, that adds energy. Identifying these automatically is tricky, but one could combine clues (e.g. if shutter speed was slow and there’s blur, or if multiple limbs positions in a burst suggest movement). A simpler approximation: use CLIP with prompts like “action shot” or “dynamic scene” versus “static scene” to see if CLIP leans one way. In practice, this component might be the most experimental; however, since the aesthetic and composition scores already favor well-structured images, this could be a smaller weighting that gives a slight edge to images that “feel alive.”
All these pieces would be combined into a final score or ranking algorithm. For example, one might compute:
\text{KimScore} = w_a \cdot \text{AestheticScore} + w_c \cdot \text{CompositionScore} + w_s \cdot \text{SimplicityScore} + w_h \cdot \text{HumanEmotionScore} + w_d \cdot \text{DynamicScore},
with weights $w_a, w_c, \dots$ tuned by experimentation. The highest-scoring images according to this KimScore would presumably be those that are aesthetically pleasing, powerfully composed, and emotionally engaging – effectively “the best photos” by Eric Kim’s criteria.
Crucially, building this involves iteration and validation. One would likely test the tool on a set of photos (perhaps a mix of the user’s own images or known great vs mediocre shots) and see if the top picks indeed align with human preference. It may be necessary to adjust the weighting or even incorporate a small amount of training – e.g., if the user can label a subset of photos as “favorites” vs “discard,” one could fine-tune a model or calibrate the scoring function to better match the user’s taste (which hopefully mirrors the Eric Kim style if that’s the aim).
Implementation Strategy and Practical Steps
Choosing the Platform: Given the need to handle up to 1,000 images efficiently, a local application or a client-side tool is a sensible choice. Uploading thousands of JPEGs (potentially several GB of data) to a server can be slow and raises privacy concerns – especially if these are personal photos. A desktop application (or a command-line script) can leverage the user’s hardware (CPU/GPU) to run the analysis without needing to send images over the internet. Alternatively, a lightweight web app could be created where the heavy lifting happens server-side (using cloud compute), but this requires a robust back-end and possibly incur costs for processing (especially if using paid APIs or a GPU server). For a solo photographer’s workflow, a local tool is likely more practical. Tools like Python with libraries (TensorFlow/PyTorch for models, OpenCV for image processing, etc.) are well-suited, and one can create a simple GUI or use a notebook/command line. For a friendlier interface without much coding, one could build a small Streamlit web app that runs locally – it would allow drag-and-drop of images and display results in the browser, but all computation still happens on the user’s machine.
Key Steps to Build the System:
- Set Up the Environment: Install the necessary libraries and models. This includes a deep learning framework (e.g. PyTorch) and loading pre-trained model weights for aesthetic scoring. For example, you might install a PyTorch implementation of NIMA or use a model checkpoint from the AVA dataset. Also install OpenCV (for image operations), a face/emotion detector (e.g. mediapipe or deepface), and optionally the transformers library for CLIP if using it.
- Image Ingestion: Implement a way for the user to input up to 1000 images. This could simply be pointing the tool at a folder of JPEGs or using a file uploader component (in a GUI/Streamlit). The tool should batch-process images efficiently – possibly resizing them first for faster processing (many models don’t need full resolution; e.g., NIMA was trained on images resized to a certain size). Keep an eye on memory if loading many images at once; processing sequentially or in mini-batches is safer.
- Aesthetic Scoring: Run each image through the aesthetic model to get a base score. If using NIMA, you’ll get a distribution or mean score per image . If using an API like Everypixel, send each image (or its URL) and retrieve the score (noting that API usage for 1000 images might require an API key and has cost limits – Everypixel offers 1000 images for $0.6 in their pricing , which is actually quite affordable for occasional use). Ensure to handle the API responses or model outputs carefully and store the scores.
- Composition & Simplicity Analysis: For each image, apply composition checks. If you have a classifier (like Nyckel’s), you might call it via API or, if you trained one, run the classifier locally. Alternatively, extract features: e.g., use OpenCV to detect edges and lines (Hough transform) to see if strong diagonals exist, or compute the rule-of-thirds alignment of the main subject. The main subject could be found by either using an object detection model or assuming the largest object/face is the subject. Also calculate a simplicity metric: perhaps count detected faces/people or objects – images with one person might be scored higher than those with ten people fighting for attention (unless the story needs multiple subjects, but that’s an edge case). If using face detection for emotion, also note if a face is found and what expression. For example, if an image has no people at all, the “human-centric” component might default to 0, whereas an image with an identifiable emotional expression might get a positive boost.
- CLIP-based Semantic Scoring (Optional): If incorporating CLIP, encode each image with CLIP’s image encoder, and also encode one or more descriptive prompts that represent Eric Kim’s ideals. For instance, you could use two prompts like “a powerful, emotional street photograph” and “a dull ordinary snapshot” and measure cosine similarity of the image embedding to each. The difference (image–positive minus image–negative similarity) could yield a semantic score reflecting how close the image is to the “ideal”. Researchers found such fixed prompting can effectively distinguish aesthetic quality . You might craft multiple prompts (covering simplicity, emotion, etc.) and ensemble them as per the CLIP aesthetics paper’s method . This step is more experimental but could directly incorporate the “philosophical” aspect in natural language terms.
- Score Aggregation: Once all metrics are computed for an image (aesthetic score, composition score, etc.), combine them into a final score as discussed. This might involve normalizing each sub-score to a common scale (since different models output different ranges – e.g., NIMA might be 0–10, composition classifier 0/1, face count an integer, etc.). One simple approach is to convert everything to 0–1 range and then apply weights. You might start with equal weights and then adjust: e.g., if you find the AI is picking technically perfect but emotionless images, increase the weight on the “human emotion” factor. This can be done through trial and error or even a small training approach if ground truth rankings are available.
- Selecting Top Photos: With a final KimScore for each image, sort the images in descending order. The top N images (perhaps top 10 or 50, depending on how many “best” the user wants) are the selections. Provide these to the user – e.g., display thumbnails of the winners, maybe even with a short explanation or score breakdown. A neat feature for transparency might be showing why a certain photo scored high (e.g., “Score 0.85 – High aesthetic appeal, good composition, emotional facial expression detected”). This helps the user trust the AI’s picks and also learn which of their images have the qualities in question.
- Iteration and Tuning: It’s unlikely the first version gets everything right. The user (or evaluator) should review the chosen images to see if they indeed have the desired impact. If some duds slipped through, analyze why. Perhaps an image was sharply composed but had no real story – you might then decide to weight the emotion factor more. Or if an image with a lot of grain but great content was ranked low due to the aesthetic model disliking noise, you might decide to down-weight technical quality in favor of content. Since Eric Kim’s philosophy would value a gritty but emotionally charged street photo over a sterile but perfectly exposed shot, your weighting should reflect that priority. It’s an iterative design process to encode subjective preferences into the model.
Feasibility and Considerations: This system is ambitious because it attempts to quantify the unquantifiable (story and emotion), but it is feasible to implement a working version with today’s AI tools. The core image scoring (aesthetic/composition) can run reasonably fast – models like NIMA or CLIP can process an image in fractions of a second on a modern GPU. With 1000 images, you might process them in batches and get results in a matter of minutes (a bit longer on CPU). Using local models avoids API call overhead; however, if using external APIs, be mindful of rate limits (you may need to throttle requests). Also, ensure the app can handle different image sizes and orientations (maybe auto-rotate images with EXIF data so that composition analysis isn’t fooled by sideways images).
Another practical aspect is interface: If building a desktop GUI, frameworks like Qt or Tkinter could be used, but a simpler route is a web UI. A Streamlit app, for example, can provide file upload and then show output images and could be run locally. If a web app with a server is preferred (perhaps to allow using a more powerful cloud GPU), one must build a backend (maybe a Flask or FastAPI service that wraps the models) and a front-end for uploads. For a one-off tool or personal use, this overhead might not be necessary.
Finally, remember that AI selection should assist but not fully replace human judgment. It would be wise to let the user review the top picks rather than deleting all others. The AI can shortlist, say, 50 out of 1000, which the photographer can then manually go through. This way, the “Eric Kim AI” acts like an intelligent advisor, implementing Eric’s principles at scale, but the human can make the final call – combining computational speed with human intuition, which is very much in line with using AI as a creative assistant rather than a definitive judge .
Conclusion
Building an AI photo curation tool inspired by Eric Kim’s ethos involves blending state-of-the-art image assessment models with custom criteria reflecting simplicity, dynamism, and human feeling. While current AI can reliably judge technical and basic aesthetic quality, imbuing it with a “philosophical lens” requires thoughtful engineering and possibly some AI fine-tuning. By leveraging existing models (for aesthetics and composition) and augmenting them with modules for detecting emotional and storytelling cues, one can create a composite scoring system to automatically sift through hundreds of images and highlight those with the strongest impact. The implementation is certainly non-trivial – it spans computer vision, maybe a bit of natural language processing (if using CLIP), and software design to handle bulk image input – but it is achievable with today’s open-source tools and APIs. The end result would be a powerful assistant that helps photographers identify their most compelling shots, echoing Eric Kim’s focus on powerful simplicity and human stories, at a scale and speed impossible to do manually. Such a tool would not only save time but could also serve as a learning feedback mechanism: by seeing which photos the AI (trained on human aesthetic principles) selects, photographers might gain insights into compositional and emotional elements that make an image stand out . It’s an exciting convergence of art and AI – using machine learning to amplify a human creative vision.
Sources: The approach and recommendations above draw upon research from Google on neural image assessment , academic studies on composition analysis , as well as insights from Eric Kim’s own writings on dynamic composition and emotion in photography . Existing AI tools like Everypixel’s scoring API and Nyckel’s composition classifier demonstrate the capabilities and limitations of current image aesthetics technology . Products like Peakto illustrate how combined aesthetic and technical evaluation can identify top images in a library . Ultimately, the system proposed is grounded in these state-of-the-art techniques, while pushing into more subjective territory – an area where ongoing innovations (like CLIP-based aesthetic reasoning ) are rapidly expanding what’s possible for AI-driven image curation.