Egregoros · Phoenix Framework

Andrew Zonenberg

@azonenberg@ioc.exchange

Security and open source at the hardware/software interface. Embedded sec @ IOActive. Lead dev of ngscopeclient/libscopehal. GHz probe designer. Open source networking hardware. "So others may live"

Toots searchable on tootfinder.

Posts

3303

Followers

475

Following

Posts

Latest notes

Andrew Zonenberg

@azonenberg@ioc.exchange remote

I really don't need a new project, but I would love a clustered and/or GPU accelerated image editor for massive images.

Something that can comfortably and performantly rotate, crop, layer stack, align, etc. images at gigapixel scale without bogging down for minutes at a time even on my big iron.

Across my lab I can harness something like 160 physical / 320 logical CPU cores, 1.3 TB of RAM, 50+ GB of VRAM, and more shader cores than I feel like counting.

Everything is or will soon (after I swap a few NICs) be connected by 40/100GbE.

Deleting part of a layer in an 0.8 gigapixel image with five layers, with that much compute available, doesn't seem like it should be the kind of thing I have to sit and watch GIMP updating scanline by scanline.

Andrew Zonenberg

@azonenberg@ioc.exchange remote

Like even drawing a selection on parts of this image is making GIMP slow. it's just a list of polygon vertices for the outlined area, this should be *trivial*. Especially since dragging to make a selection can't have a resolution greater than the currently displayed pixel size.

Andrew Zonenberg

@azonenberg@ioc.exchange remote

@petrillic All of my Windows boxen are VMs and my one mac is a Mini with 16GB of RAM that probably can't even open a file this big much less edit it.

The other thing is, it needs to not assume it can fit the entire image into VRAM. Like, this particular image and workstation combo would probably work (it's ~800mpix, 5 layers, so ~4 Gpix, at RGBA32 that's 16GB and my new GPU has 32GB of VRAM) but anything much bigger would not.

So intelligently managing transfers between CPU and GPU to do the transformations will be part of the requirements for such a tool to avoid getting horribly bottlenecked on PCIe bandwidth

Andrew Zonenberg

@azonenberg@ioc.exchange remote

Weird things that happen when you have a 100GbE pipe to your desk. 8 Gbps of sustained network traffic and the network monitor is like "yeah you're not using much bandwidth"

Also I think there's a 32-bit overflow or something in xfce4-netload-plugin because the rate shows 0.00 Mbps when I get above some threshold (not sure what it is exactly but it's in the 15-40 Gbps range)

Andrew Zonenberg

@azonenberg@ioc.exchange remote

18w

The question is, is there a way to do this efficiently that isn't just delta coding followed by running the decode on a list of transitions?

I kinda feel like it partially depends on the level of oversampling: are you expecting like one SCL edge per 5 samples, or one per 5000?

Andrew Zonenberg

@azonenberg@ioc.exchange remote

18w

@ignaloidas GPU malloc is extremely expensive and not something you ever want to do every iteration of a shader.

We go out of our way to recycle allocations as many times as possible.

Andrew Zonenberg

@azonenberg@ioc.exchange remote

20w

@niconiconi /me looks at "high frequency" radio and "fast" ethernet

Andrew Zonenberg

@azonenberg@ioc.exchange remote

18w

Thinking about decoding some of the more complicated protocols in ngscopeclient more fully on the GPU.

Let's take I2C. for example. I have a 2x 10M point capture coming off my STM32MP2 / Kintex-7 testbed that takes about 188 ms to decode on the Xeon 4310 on my lab workstation (lots of idle time with just a few packets).

The decode can accept either sparse or uniformly sampled data; right now it's getting uniform data at 100 Msps which is overkill but the ThunderScope doesn't yet let you decimate to go any slower.

So for the "sampling at many times the symbol rate" use case, the easiest GPU win might be to delta code uniformly sampled data and store SDA/SCL separately as a sparse waveform (i.e. sample value), start time, duration).

But that also involves storing and reading back from a temporary memory buffer (which will have to be as big as the waveform, since there's no way to know in advance how many I2C events there will be).

Which brings us to the second option: try to implement the entire decode inner loop in a shader.