Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pegdown processor hangs when data to be parsed is html #207

Open
sunitapatro opened this issue Dec 2, 2015 · 11 comments
Open

Pegdown processor hangs when data to be parsed is html #207

sunitapatro opened this issue Dec 2, 2015 · 11 comments

Comments

@sunitapatro
Copy link

I am trying to read from a URL for which i do not have access, so its redirecting to login page. So the data input to pegDownProcessor.markdownToHtml(data) is actually HTML.

I was expecting either null or parsing exception but it hangs at markdownToHtml(data).

Here is my code:

//url = http://localhost:8098/download/attachments/3145973/basics.text?version=1&modificationDate=1449060565788&api=v2
InputStream stream = getUrlStream(info, profileHelper, url, false, getSessionCookie(info, url));
String data = ScriptUtils.getStreamAsString(stream, info.getMacroParams().getString("encoding", ""));
PegDownProcessor pegDownProcessor = new PegDownProcessor(Extensions.ALL - (hardwraps ? 0 : Extensions.HARDWRAPS)+ (allowHtml ? 0 : Extensions.SUPPRESS_ALL_HTML));
processed = pegDownProcessor.markdownToHtml(data);
log.debug("processed: {}", processed);

Any help to deal this is appreciated.
Thanks!!!

@vsch
Copy link
Contributor

vsch commented Dec 2, 2015

@sunitapatro, I suggest you dump the data so that you can see what is returned. Then you should pre-screen this type of input to pegdown to prevent it from hanging. Please post the errant data so that pegdown can potentially be updated to prevent such hangs.

@sunitapatro
Copy link
Author

Thanks for responding.

Here is the 'data' input to pegDownProcessor.markdownToHtml(data)
urldata.txt

@vsch
Copy link
Contributor

vsch commented Dec 3, 2015

@sunitapatro, what I meant by data is not the markdown you expect to get but the actual HTML returned by the getStreamAsString() caused by the redirect to login.

Add code to log the received data from the get stream, before passing of it to pegdown, so that the cause of the hang can be debugged.

You should probably add code at the same point that will detect that the data coming back is not markdown but HTML and present it as is, without pegdown processing, so that when this happens there is some feedback to the user.

@sunitapatro
Copy link
Author

The data that is being returned by the getStreamAsString() is nothing but content of "urldata.txt" which i shared earlier. I just saved it as .txt just to share here.
Its a NEGATIVE test, actually getStreamAsString() was supposed to return .text file (with markdown syntax data) read from an URL, but since that URL needs authentication, so it returns the login.html page. So the urldata.txt is nothing but HTML content if you see.

In short, the urldata.txt (contains HTML) is the content input to PegDownProcessor.

I understand that its wrong content to PegDownProcessor, but then i was expecting PegDownProcessor to return either some exception or null or something like that. But reality is its hanging.

@vsch
Copy link
Contributor

vsch commented Dec 3, 2015

nothing but HTML content covers a universe of possibilities. It is impossible to guess what exactly is causing the problem in pegdown parser without having input that duplicates the problem. After all, pegdown is just another program, like yours, all debugging requires input to be able to narrow down where things go wrong.

Validating input is really limited to markdown and the handled HTML tags. Handling unadulterated HTML response from a server is outside of its intended application. I do agree that it should not hang, but without having a file which causes the hang I can't being to figure out what causes it.

It is up to the implementation specific code to make sure that what is fed into pegdown can at least be considered as markdown.

@sunitapatro
Copy link
Author

@vsch,

Please read the file urldata.txt content to Stream, convert to String and pass to PegDownProcessor and you will be able to reproduce PegDownProcessor hang.
urldata.txt

@sunitapatro
Copy link
Author

@vsch
Any update on this?

@vsch
Copy link
Contributor

vsch commented Dec 7, 2015

@sunitapatro, to save time, I opened the file in pegdown using my IntelliJ IDEA plugin (idea-multimarkdown) which uses pegdown as the parser, by renaming it to urldata.md and opening it in IDEA. I saw no issues and no hangs.

The problem occurs when you read the file as a stream and convert to string, it is not a pegdown issue but in the code before the pegdown call.

As a test I suggest before passing the string data to pegdown, convert it to a char[] which is what pegdown does via string.toCharArray() and then dump the char array as bytes to a file and examine what contents you are really passing to pegdown. The file you provided does not cause pegdown to hang, so the issue is somewhere else.

@ZephyrJung
Copy link

ZephyrJung commented Nov 23, 2016

I find this problem too. The version is 1.6.0

final PegDownProcessor pegDownProcessor = new PegDownProcessor(Extensions.ALL_OPTIONALS | Extensions.ALL_WITH_OPTIONALS, 5000);
//markdownText is pure html
final RootNode node = pegDownProcessor.parseMarkdown(markdownText.toCharArray());

code like above, and it will stop the function exactly at parseMarkdown function
sorry for my poor English

@ZephyrJung
Copy link

qq 20161124095131
it seems that the code drop into a never end loop

@ZephyrJung
Copy link

ZephyrJung commented Nov 24, 2016

I solve the problem by add extra tags like “< html >< body >”+data+"</ body ></ html >"
data is pure html witch contains many < li >

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants